2713 lines
242 KiB
Plaintext
2713 lines
242 KiB
Plaintext
Springer Actuarial
|
||
Arthur Charpentier
|
||
Insurance, Biases, Discrimination and Fairness
|
||
|
||
Springer Actuarial
|
||
Editors-in-Chief Hansjoerg Albrecher, Department of Actuarial Science, University of Lausanne, Lausanne, Switzerland Michael Sherris, School of Risk & Actuarial, UNSW Australia, Sydney, NSW, Australia
|
||
Series Editors Daniel Bauer, Wisconsin School of Business, University of Wisconsin-Madison, Madison, WI, USA Stéphane Loisel, ISFA, Université Lyon 1, Lyon, France Alexander J. McNeil, University of York, York, UK Antoon Pelsser, Maastricht University, Maastricht, The Netherlands Gordon Willmot, University of Waterloo, Waterloo, ON, Canada Hailiang Yang, Department of Statistics & Actuarial Science, The University of Hong Kong, Hong Kong, Hong Kong
|
||
|
||
This is a series on actuarial topics in a broad and interdisciplinary sense, aimed at students, academics and practitioners in the fields of insurance and finance.
|
||
Springer Actuarial informs timely on theoretical and practical aspects of topics like risk management, internal models, solvency, asset-liability management, market-consistent valuation, the actuarial control cycle, insurance and financial mathematics, and other related interdisciplinary areas.
|
||
The series aims to serve as a primary scientific reference for education, research, development and model validation.
|
||
The type of material considered for publication includes lecture notes, monographs and textbooks. All submissions will be peer-reviewed.
|
||
|
||
Arthur Charpentier
|
||
Insurance, Biases, Discrimination and Fairness
|
||
|
||
Arthur Charpentier Department of Mathematics UQAM Montreal, QC, Canada
|
||
|
||
ISSN 2523-3262
|
||
|
||
ISSN 2523-3270 (electronic)
|
||
|
||
Springer Actuarial
|
||
|
||
ISBN 978-3-031-49782-7
|
||
|
||
ISBN 978-3-031-49783-4 (eBook)
|
||
|
||
https://doi.org/10.1007/978-3-031-49783-4
|
||
|
||
This work was supported by Institut Louis Bachelier.
|
||
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
|
||
This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
|
||
Paper in this product is recyclable.
|
||
|
||
Preface
|
||
“14 litres d’encre de chine, 30 pinceaux, 62 crayons à mine grasse, 1 crayon à mine dure, 27 gommes à effacer, 38 kilos de papier, 16 rubans de machine à écrire, 2 machines à écrire, 67 litres de bière ont été nécessaires à la réalisation de cette aventure,” Goscinny and Uderzo (1965), Astérix et Cléopâtre.1
|
||
Discrimination is a complicated concept. The most neutral definition is, according to Merriam-Webster (2022), simply “the act (or power) of distinguishing.” Amnesty International (2023) adds that it is an “unjustified distinction.” And indeed, most of the time, the word has a negative connotation, because discrimination is associated with some prejudice. An alternative definition, still according to Merriam-Webster (2022), is that discrimination is the “act of discriminating categorically rather than individually.” This corresponds to “statistical discrimination” but also actuarial pricing. Because actuaries do discriminate. As Lippert-Rasmussen (2017) clearly states, “insurance discrimination seems immune to some of the standard objections to discrimination.” Avraham (2017) goes further: “what is unique about insurance is that even statistical discrimination which by definition is absent of any malicious intentions, poses significant moral and legal challenges. Why? Because on the one hand, policy makers would like insurers to treat their insureds equally, without discriminating based on race, gender, age, or other characteristics, even if it makes statistical sense to discriminate (...) On the other hand, at the core of insurance business lies discrimination between risky and non-risky insureds. But riskiness often statistically correlates with the same characteristics policy makers would like to prohibit insurers from taking into account.” This is precisely the purpose of this book, to dig further into those issues, to understand the seeming oxymoron “fair discrimination” used in insurance, to weave together the multiple perspectives that have been posed on discrimination in insurance, linking a legal and a statistical view, an economic and an actuarial view, all in a context where computer scientists have also recently brought an enlightened eye to the question
|
||
1 14 liters of ink, 30 brushes, 62 grease pencils, 1 hard pencil, 27 erasers, 38 kilos of paper, 16 typewriter ribbons, 2 typewriters, 67 liters of beer were necessary to realize this adventure.
|
||
v
|
||
|
||
vi
|
||
|
||
Preface
|
||
|
||
of the fairness of predictive models. Dealing with discrimination in insurance is probably an ill-defined unsolvable problem, but it is important to understand why, in the current context of “big data” (that yield to proxy variables that can capture information relative to sensitive attributes) and “artificial intelligence” (or more precisely “machine-learning” techniques, and opaque boxes, less interpretable).
|
||
This book attempts to address questions such as “is being color-blind or genderblind sufficient (or necessary) to ensure that a model is fair”? “how can we assess if a model is fair if we cannot collect sensitive information”? “is it fair that part of the health insurance premium paid by a man should be dedicated to covering the risk of becoming pregnant”? “is it fair to ask a smoker to pay more for his or her health insurance premium”? “is it fair to use a gender-neutral principle if we end up asking a higher premium to women”? “is it fair to use a discrimination-free model on biased data”? “is it fair to use in a pricing model a legitimate variable if it correlates strongly with a sensitive one”? Those are obviously old questions, raised when actuaries started to differentiate premiums. This book is aimed at being systematic, to connect the dots between communities that are too often distinct, bringing different perspectives on those important questions.
|
||
Before going further into the subject, a few words of thanks are in order. First, I wanted to thank Jean-Michel Beacco and Ryadh Benlahrech, of the Institut Louis Bachelier2 (ILB) in Paris: an earlier version of this book was published in the Opinion and Debates series edited by the ILB. Second, I also wanted to thank Hansjörg Albrecher, Stéphane Loisel and Julien Truffin for their encouragement in publishing that book, while we were all together enjoying a workshop in Luminy (France), on machine learning for insurance, in the Summer 2022. I want to thank Caroline Hillairet and Christian-Yann Robert for giving me the opportunity to give a doctoral course on topics presented in this book. I should thank Laurence Barry, Michel Denuit, Jean-Michel Loubès, Marie-Pier Côté, as well as colleagues who attended recent seminars, for all the stimulating discussions we had over the past 3 years on those topics. I am extremely grateful to Ewen Gallic, Agathe FernandesMachado, Antoine Ly, Olivier Côté, Philipp Ratz, and François Hu who challenged me on some parts of the original manuscript, and helped me to improve it (even if, of course, the errors that remain are entirely my fault, and my responsibility). I also wanted to thank the Chaire Thélem / ILB and the AXA Research Fund for financially supporting some of my recent research in this area, and Philippe Trainar and the SCOR Foundation for deciding that this was just the beginning, and that it was important to support research on all these topics, over the next years.
|
||
|
||
2 The ILB (Institut Louis Bachelier) is a nonprofit organization created in 2008. Its activities are aimed at engaging academic researchers, as well as public authorities and private companies in research projects in economics and finance with a focus on four societal transitions: environmental, digital, demographic, and financial. The ILB is, thus, fully involved in the design of research programs and initiatives aimed at promoting sustainable development in economics and finance.
|
||
|
||
Preface
|
||
|
||
vii
|
||
|
||
Finally, I wanted to apologize to my family, namely Hélène, Maël, Romane, and Fleur for all the time spent during evenings, week-ends and holidays working on this book.
|
||
|
||
Montréal, QC, Canada July 2023
|
||
|
||
Arthur Charpentier
|
||
|
||
Contents
|
||
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 A Brief Overview on Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Discrimination? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Legal Perspective on Discrimination . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Discrimination from a Philosophical Perspective . . . . . . . . . 5 1.1.4 From Discrimination to Fairness. . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.5 Economics Perspective on Efficient Discrimination . . . . . . 10 1.1.6 Algorithmic Injustice and Fairness of Predictive Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.1.7 Discrimination Mitigation and Affirmative Action . . . . . . . 15 1.2 From Words and Concepts to Mathematical Formalism . . . . . . . . . . . 15 1.2.1 Mathematical Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2.2 Legitimate Segmentation and Unfair Discrimination . . . . . 16 1.3 Structure of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4 Datasets and Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
|
||
Part I Insurance and Predictive Modeling 2 Fundamentals of Actuarial Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
|
||
2.1 Insurance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Premiums and Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3 Premium and Fair Technical Price. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
|
||
2.3.1 Case of a Homogeneous Population . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.2 The Fear of Moral Hazard and Adverse-Selection . . . . . . . . 32 2.3.3 Case of a Heterogeneous Population . . . . . . . . . . . . . . . . . . . . . . 34 2.4 Mortality Tables and Life Insurance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4.1 Gender Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4.2 Health and Mortality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4.3 Wealth and Mortality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5 Modeling Uncertainty and Capturing Heterogeneity . . . . . . . . . . . . . . . 40 2.5.1 Groups of Predictive Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
|
||
ix
|
||
|
||
x
|
||
|
||
Contents
|
||
|
||
2.5.2 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.5.3 Interpreting and Explaining Models . . . . . . . . . . . . . . . . . . . . . . . 47 2.6 From Technical to Commercial Premiums . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.6.1 Homogeneous Policyholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.6.2 Heterogeneous Policyholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.6.3 Price Optimization and Discrimination . . . . . . . . . . . . . . . . . . . . 52 2.7 Other Models in Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.7.1 Claims Reserving and IBNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.7.2 Fraud Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.7.3 Mortality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.7.4 Parametric Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.7.5 Data and Models to Understand the Risks. . . . . . . . . . . . . . . . . 55
|
||
3 Models: Overview on Predictive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.1 Predictive Model, Algorithms, and “Artificial Intelligence” . . . . . . . 59 3.1.1 Probabilities and Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.1.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2 From Categorical to Continuous Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.2.1 Historical Perspective, Insurers as Clubs . . . . . . . . . . . . . . . . . . 65 3.2.2 “Modern Insurance” and Categorization . . . . . . . . . . . . . . . . . . 67 3.2.3 Mathematics of Rating Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.2.4 From Classes to Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.3 Supervised Models and “Individual” Pricing . . . . . . . . . . . . . . . . . . . . . . . 76 3.3.1 Machine-Learning Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.3.2 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.3.3 Penalized Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . 97 3.3.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.3.5 Trees and Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.3.6 Ensemble Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.3.7 Application on the toydata2 Dataset . . . . . . . . . . . . . . . . . . . 116 3.3.8 Application on the GermanCredit Dataset . . . . . . . . . . . . 116 3.4 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
|
||
4 Models: Interpretability, Accuracy, and Calibration . . . . . . . . . . . . . . . . . . . 123 4.1 Interpretability and Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.1.1 Variable Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.1.2 Ceteris Paribus Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.1.3 Breakdowns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.1.4 Shapley Value and Shapley Contributions . . . . . . . . . . . . . . . . . 133 4.1.5 Partial Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 4.1.6 Application on the GermanCredit Dataset . . . . . . . . . . . . 147 4.1.7 Application on the FrenchMotor Dataset . . . . . . . . . . . . . . 153 4.1.8 Counterfactual Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 4.2 Accuracy of Actuarial Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 4.2.1 Accuracy and Scoring Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 4.3 Calibration of Predictive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
|
||
|
||
Contents
|
||
|
||
xi
|
||
|
||
4.3.1 4.3.2 4.3.3
|
||
|
||
From Accuracy to Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Lorenz and Concentration Curves . . . . . . . . . . . . . . . . . . . . . . . . . 165 Calibration, Global, and Local Biases . . . . . . . . . . . . . . . . . . . . . 170
|
||
|
||
Part II Data
|
||
5 What Data?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 5.1 Data (a Brief Introduction). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 5.2 Personal and Sensitive Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 5.2.1 Personal and Nonpersonal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 5.2.2 Sensitive and Protected Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 5.2.3 Sensitive Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 5.2.4 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 5.2.5 Right to be Forgotten . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 5.3 Internal and External Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 5.3.1 Internal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 5.3.2 Connecting Internal and External Data . . . . . . . . . . . . . . . . . . . . 192 5.3.3 External Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 5.4 Typology of Ratemaking Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 5.4.1 Ratemaking Variables in Motor Insurance . . . . . . . . . . . . . . . . 199 5.4.2 Criteria for Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 5.4.3 An Actuarial Criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 5.4.4 An Operational Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 5.4.5 A Criterion of Social Acceptability . . . . . . . . . . . . . . . . . . . . . . . . 204 5.4.6 A Legal Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 5.5 Behaviors and Experience Rating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 5.6 Omitted Variable Bias and Simpson’s Paradox . . . . . . . . . . . . . . . . . . . . . 205 5.6.1 Omitted Variable in a Linear Model . . . . . . . . . . . . . . . . . . . . . . . 205 5.6.2 School Admission and Affirmative Action . . . . . . . . . . . . . . . . 207 5.6.3 Survival of the Sinking of the Titanic. . . . . . . . . . . . . . . . . . . . . . 208 5.6.4 Simpson’s Paradox in Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 5.6.5 Ecological Fallacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 5.7 Self-Selection, Feedback Bias, and Goodhart’s Law . . . . . . . . . . . . . . . 211 5.7.1 Goodhart’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 5.7.2 Other Biases and “Dark Data” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
|
||
6 Some Examples of Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.1 Racial Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.1.1 A Sensitive Variable Difficult to Define . . . . . . . . . . . . . . . . . . . 218 6.1.2 Race and Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 6.2 Sex and Gender Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 6.2.1 Sex or Gender? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 6.2.2 Sex, Risk and Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 6.2.3 The “Gender Directive” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 6.3 Age Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 6.3.1 Young or Old? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
|
||
|
||
xii
|
||
|
||
Contents
|
||
|
||
6.4 Genetics versus Social Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 6.4.1 Genetics-Related Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . 234 6.4.2 Social Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 6.4.3 “Lookism,” Obesity, and Discrimination . . . . . . . . . . . . . . . . . . 237
|
||
6.5 Statistical Discrimination by Proxy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 6.5.1 Stereotypes and Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 6.5.2 Generalization and Actuarial Science . . . . . . . . . . . . . . . . . . . . . 241 6.5.3 Massive Data and Proxy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
|
||
6.6 Names, Text, and Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 6.6.1 Last Name and Origin or Gender . . . . . . . . . . . . . . . . . . . . . . . . . . 246 6.6.2 First Name and Age or Gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 6.6.3 Text and Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 6.6.4 Language and Voice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
|
||
6.7 Pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 6.7.1 Pictures and Facial Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 6.7.2 Pictures of Houses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
|
||
6.8 Spatial Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 6.8.1 Redlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 6.8.2 Geography and Wealth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
|
||
6.9 Credit Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 6.9.1 Credit Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 6.9.2 Discrimination Against the Poor . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
|
||
6.10 Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 6.10.1 On the Use of Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 6.10.2 Mathematics of Networks, and Paradoxes. . . . . . . . . . . . . . . . . 270
|
||
7 Observations or Experiments: Data in Insurance . . . . . . . . . . . . . . . . . . . . . . 275 7.1 Correlation and Causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 7.1.1 Correlation is (Probably) Not Causation . . . . . . . . . . . . . . . . . . 276 7.1.2 Causality in a Dynamic Context . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 7.2 Rung 1, Association (Seeing, “what if I see...”) . . . . . . . . . . . . . . . . . . . . 280 7.2.1 Independence and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 7.2.2 Dependence with Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 7.3 Rung 2, Intervention (Doing, “what if I do...”) . . . . . . . . . . . . . . . . . . . . . 290 7.3.1 The do() Operator and Computing Causal Effects . . . . . . . . 290 7.3.2 Structural Causal Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 7.4 Rung 3, Counterfactuals (Imagining, “what if I had done...”) . . . . . 296 7.4.1 Counterfactuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 7.4.2 Weights and Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . 301 7.5 Causal Techniques in Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
|
||
Part III Fairness
|
||
8 Group Fairness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 8.1 Fairness Through Unawareness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 8.2 Independence and Demographic Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
|
||
|
||
Contents
|
||
|
||
xiii
|
||
|
||
8.3 Separation and Equalized Odds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 8.4 Sufficiency and Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 8.5 Comparisons and Impossibility Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . 339 8.6 Relaxation and Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 8.7 Using Decomposition and Regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 8.8 Application on the GermanCredit Dataset . . . . . . . . . . . . . . . . . . . . . . 351 8.9 Application on the FrenchMotor Dataset . . . . . . . . . . . . . . . . . . . . . . . . 351
|
||
9 Individual Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 9.1 Similarity Between Individuals (and Lipschitz Property) . . . . . . . . . . 358 9.2 Fairness with Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 9.3 Counterfactuals and Optimal Transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 9.3.1 Quantile-Based Transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 9.3.2 Optimal Transport (Discrete Setting) . . . . . . . . . . . . . . . . . . . . . . 363 9.3.3 Optimal Transport (General Setting) . . . . . . . . . . . . . . . . . . . . . . 366 9.3.4 Optimal Transport Between Gaussian Distributions . . . . . . 368 9.3.5 Transport and Causal Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 9.4 Mutatis Mutandis Counterfactual Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . 369 9.5 Application on the toydata2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 9.6 Application to the GermanCredit Dataset . . . . . . . . . . . . . . . . . . . . . . . 373
|
||
Part IV Mitigation
|
||
10 Pre-processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 10.1 Removing Sensitive Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 10.2 Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 10.2.1 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 10.2.2 Binary Sensitive Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 10.3 Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 10.4 Application to toydata2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 10.5 Application to the GermanCredit Dataset . . . . . . . . . . . . . . . . . . . . . . . 393
|
||
11 In-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 11.1 Adding a Group Discrimination Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 11.2 Adding an Individual Discrimination Penalty . . . . . . . . . . . . . . . . . . . . . . 400 11.3 Application on toydata2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 11.3.1 Demographic Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 11.3.2 Equalized Odds and Class Balance . . . . . . . . . . . . . . . . . . . . . . . . 403 11.4 Application to the GermanCredit Dataset . . . . . . . . . . . . . . . . . . . . . . . 407 11.4.1 Demographic Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 11.4.2 Equalized Odds and Class Balance . . . . . . . . . . . . . . . . . . . . . . . . 410
|
||
12 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 12.1 Post-Processing for Binary Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 12.2 Weighted Averages of Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 12.3 Average and Barycenters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 12.4 Application to toydata1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
|
||
|
||
xiv
|
||
|
||
Contents
|
||
|
||
12.5 Application on FrenchMotor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 12.6 Penalized Bagging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
|
||
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
|
||
|
||
Mathematical Notation
|
||
|
||
Acronyms
|
||
|
||
ACA CJEU CNIL DAG EIOPA GAM GDPR GLM HOLC INED PAID
|
||
|
||
Affordable Care Act Court of Justice of the European Union (French) National Commission on Informatics and Liberty directed acyclical graph European Insurance and Occupational Pensions Authority Generalized Additive Models General Data Protection Regulation Generalized Linear Models Home Owners’ Loan Corporation (French) Institute for Demographic Studies Prohibit Auto Insurance Discrimination
|
||
|
||
Mathematical Notations
|
||
|
||
.X ⊥ Y .⊥G | .X ⊥⊥ Y
|
||
.X ⊥⊥ Y | Z .X =L Y .Xn →L Y .Xn ∼L Y .A .y . x1 − x2 .x ⊥ .A .∝
|
||
|
||
non-correlated variables, .X ⊥ Y : Cor[X, Y ] = 0 d-separability in causal inference independence, .X ⊥⊥ Y, P[{X ∈ A}∩{Y ∈ B}] = P[X ∈ A]·P[Y ∈ B], ∀A, B conditional independence, .X ⊥⊥ Y | Z
|
||
equal in distribution (probabilistic statement)
|
||
convergence in distribution (probabilistic statement)
|
||
almost equal in distribution (statistical statement) complement of a set .A in . , .A = \ A. empirical average from a sample .{y1, · · · , yn} norm (classically . 2-norm) transformation of a vector, orthogonal to .s, .x⊥ = x − sx transpose of a matrix, notation .A proportionality sign
|
||
|
||
xv
|
||
|
||
xvi
|
||
|
||
Mathematical Notation
|
||
|
||
.1
|
||
.I
|
||
.{A, B} .A .aj (xj∗) ACC argmax ATE AUC .B(n, p) .C .C CATE .Cor[X, Y ], r
|
||
.Cov[X, Y ] .d(x1, x2) .D(p1, p2) .d .D . j |S (x∗)
|
||
.do(X = x) .E[X] f F .F −1 FN FNR FP FPR FPR.(t ) .γ
|
||
.γjbd (x ∗ ) .γjshap (x ∗ ) G .G .GUI i j k
|
||
|
||
indicator, .1A(x) = 1(x ∈ A) = 1 if .x ∈ A, 0 otherwise or vector of ones, .1 = (1, 1, · · · , 1) ∈ Rn, in linear algebra identity matrix, square matrix with 1 entries on the diagonal, 0 elsewhere values taken by a binary sensitive attribute s adjacency matrix accumulated local function, for variable j at location .x∗ accuracy, (TP+TN)/(P+N) arguments (of the maxima) average treatment effect area under the (ROC) curve binomial distribution—or Bernoulli .B(p) for .B(n, p) ROC curve, .t → TPR ◦ FPR−1(t) convex hull of the ROC curve conditional average treatment effect √Pearson’s correlation, .r = Cor[X, Y ] = Cov[X, Y ]/
|
||
Var[X] · Var[Y ] covariance, .Cov[X, Y ] = E[(X − E[X])(Y − E[Y ])] distance between two points in .X divergence vector of degrees, in a network training dataset, or .Dn contribution of the j -th variable, at .x∗, conditional on a subset of variables, .S ⊂ {1, · · · , k}\{j } do operator (for an intervention in causal inference) expected value (under probability .P) density associated with cumulative distribution function F cumulative distribution function, with respect to probability .P (generalized) quantile, .F −1(p) = inf {x ∈ R : p ≤ F (x)} false negative (from confusion matrix) false-negative rate, also called miss rate false positive (from confusion matrix) false-positive rate, also called fall-out function .[0, 1] → [0, 1], .FPR(t) = P[m(X > t|Y = 0)] Gini mean difference, .E |Y − Y | whereY, Y ∼ F are independent copies breakdown contribution of the j -th variable, for individual .x∗
|
||
Shapley contribution of the j -th variable, for individual .x∗ Gini index some graph, with nodes and edges group fairness index index for individuals (usually, .i ∈ {1, · · · , n}) index for variables (usually, .j ∈ {0, 1, · · · , k}) number of features used in model)
|
||
|
||
Mathematical Notation
|
||
|
||
xvii
|
||
|
||
L .L . (y1, y2)
|
||
.1 .2
|
||
. j (xj∗) .λ
|
||
.log .m(z) .m(z) .mt (z)
|
||
.mx∗,j (z) .M .μ(x) n .nA, nB
|
||
.N .N .P p .pj (xj∗) .P(λ) PDP PPV
|
||
.Z
|
||
. (p, q) .Q .r(X, Y ) .r (X, Y ) .R .Rd .Rn0×n1 .R(m) .Rn(m) .S, s .s .s, S .S .Sd T .T
|
||
|
||
Lorenz curve likelihood, .L(θ ; y) loss function on .L, . (y, y) absolute deviation loss function, . 1(y, y) = |y − y| quadratic loss function, . 2(y, y) = (y − y)2 local dependence plot, for variable j at location .x∗
|
||
tuning parameter associated with a penalty in an optimization
|
||
problem (Lagrangian) natural logarithm (with .log(ex) = x, .∀x ∈ R) predictive model, .m : Z → Y, possibly a score in .[0, 1] fitted predictive model from data .D (collection of .(yi, zi)’s) classifier based on model .m(·) and threshold .t ∈ (0, 1), .mt : Z → {0, 1}, .mt = 1(m > t) ceteris paribus profile
|
||
set of possible predictive models regression function, .E[Y |X = x] number of observations in the training sample number of observations in the training sample, respectively with .s = A and .s = B set of natural numbers, or non-negative integers (.0 ∈ N) normal (Gaussian) distribution, .N(μ, σ 2) or .N(μ, ) true probability measure probability, .p ∈ [0, 1] partial dependence plot, for variable j at location .x∗
|
||
Poisson distribution, .P(λ) with average .λ partial dependence plot
|
||
prevision or positive predictive value (from confusion matrix),
|
||
TP/(TP+FP) orthogonal projection matrix, . Z = Z(Z Z)−1Z set of multivariate distributions with “margins” p and q
|
||
some probability measure
|
||
Pearson’s correlation
|
||
maximal correlation
|
||
set of real numbers
|
||
standard vector space of dimension d set of real-valued matrices .n0 × n1 risk of a model .m, associated with loss .
|
||
empirical risk of a model .m, for a sample of size n
|
||
sensitive attribute collection of sensitive attributes, in .{0, 1}n scoring rule, and expected scoring rule (in Sect. 4.2) usually .{A, B} in the book, or .{0, 1}, so that .s ∈ S standard probability simplex (.Sd ⊂ Rd ) treatment variable in causal inference transport / coupling mapping, .X → X or .Y → Y
|
||
|
||
xviii
|
||
.T# t TN TNR TP TPR TPR.(t ) . ,θ
|
||
u .U .U (a0, a1)
|
||
.Un0,n1 .V .Var[X]
|
||
W w, .ω
|
||
.
|
||
.
|
||
.x, xi
|
||
.x j .X .Y, y .YT ←t .y .y .Y .Z, z .z .Z
|
||
|
||
Mathematical Notation
|
||
|
||
push-forward operator .P1(A) = T#P0(A) = P0 T−1(A) threshold, cut-off for a classifier, .y = 1(m(x) > t)
|
||
|
||
true negative (from confusion matrix)
|
||
|
||
true-negative rate, also called specificity or selectivity, TN/(TN+FP)
|
||
|
||
true positive (from confusion matrix)
|
||
|
||
true-positive rate, also called sensitivity or recall, TP/(TP+FN)
|
||
|
||
function .[0, 1] → [0, 1], .TPR(t) = P[m(X > t|Y = 1)]
|
||
|
||
latent unobservable risk factor (.θ for multivariate latent factors) or
|
||
|
||
unknown parameter in a parametric model
|
||
|
||
utility function
|
||
|
||
uniform distribution
|
||
|
||
set of matrices . M ∈ Rn+0×n1 : M1n1 = a0andM 1n0 = a1 .
|
||
|
||
set of matrices .U
|
||
|
||
1n0 ,
|
||
|
||
n0 n1
|
||
|
||
1n1
|
||
|
||
some value function on a subset of indices, in .{1, · · · , k}
|
||
|
||
variance, .Var[X] = E[(X − E[X])2]
|
||
|
||
covariance matrix .Var[X] = E[(X − E[X])(X − E[X]) ]
|
||
|
||
Wasserstein distance (.W2 if no index is specified) weight (.ω ≥ 0) or wealth (in the economic model)
|
||
|
||
theoretical sample space associated with a probabilistic space
|
||
|
||
weight matrix
|
||
|
||
collection of explanatory variables for a single individual, in .X ⊂ Rk
|
||
|
||
collection of observations, for variable j subset of .Rk, so that .x ∈ X = X1 × · · · × Xk
|
||
variable of interest
|
||
|
||
potential outcome of y if treatment T had taken value t collection of observations, in .Yn
|
||
|
||
prediction of the variable of interest
|
||
|
||
subset of .R, so that .y ∈ Y but also .y, m(x) ∈ Y
|
||
|
||
information, .z = (x, s), including legitimate and protected features
|
||
|
||
collection of observations .z = (x, s), in .Z
|
||
|
||
set .X × S
|
||
|
||
The following convention is used in the textbook,
|
||
x value taken by a random variable, or single number (lower case, and italics) X random variable (capital, and italics) .x vector, or collection of numerical value (lower case, italics, and bold) .X random vector or matrix (capital, italics and bold) .X set, values taken by .X (calligraphic) .xj value taken by a random variable corresponding to the j variable (from .x) .xi vector, collection of numerical value for individual i in the dataset
|
||
|
||
Chapter 1
|
||
Introduction
|
||
|
||
Abstract Although the algorithms of machine-learning methods have brought issues of discrimination and fairness back to the forefront, these topics have been the subject of an extensive body of literature over the past decades. But dealing with discrimination in insurance is fundamentally an ill-defined, unsolvable problem. Nevertheless, we try to connect the dots, to explain different perspectives, going back to the legal, philosophical, and economic approaches to discrimination, before discussing the so-called concept of “actuarial fairness.” We offer some definitions, an overview of the book, as well as the datasets used in the illustrative examples throughout the chapters.
|
||
|
||
1.1 A Brief Overview on Discrimination
|
||
1.1.1 Discrimination?
|
||
Definition 1.1 (Discrimination (Merriam-Webster 2022)) Discrimination is the act, practice, or an instance of separating or distinguishing categorically rather than individually.
|
||
In this book, we use this neutral definition of “discrimination.” Nevertheless, Kroll et al. (2017) reminds us that the word “discrimination” carries a very different meaning in statistics and computer science than it does in public policy. “Among computer scientists, the word is a value-neutral synonym for differentiation or classification: a computer scientist might ask, for example, how well a facial recognition algorithm successfully discriminates between human faces and inanimate objects. But, for policymakers, “discrimination” is most often a term of art for invidious, unacceptable distinctions among people-distinctions that either are, or reasonably might be, morally or legally prohibited.” The word discrimination can then be used both in a purely descriptive sense (in the sense of making distinctions, as in this book), or in a normative manner, which implies that the differential treatment of certain groups is morally wrong, as shown by Alexander (1992), or more recently Loi and Christen (2021). To emphasize the second meaning, we can prefer the word
|
||
|
||
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
|
||
|
||
1
|
||
|
||
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
|
||
|
||
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_1
|
||
|
||
2
|
||
|
||
1 Introduction
|
||
|
||
“prejudice”, which refers to an “unjustifiable negative attitude” (Dambrum et al. (2003) and Al Ramiah et al. (2010)) or an “irrational attitude of hostility” (MerriamWebster 2022) toward a group and its individual members.
|
||
Definition 1.2 (Prejudice (Merriam-Webster 2022)) Prejudice is (1) preconceived judgment or opinion, or an adverse opinion or leaning formed without just grounds or before sufficient knowledge; (2) an instance of such judgment or opinion; (3) an irrational attitude of hostility directed against an individual, a group, a race, or their supposed characteristics.
|
||
The definition of “discrimination” given in Correll et al. (2010) can be related to the later one “behaviour directed towards category members that is consequential for their outcomes and that is directed towards them not because of any particular deservingness or reciprocity, but simply because they happen to be members of that category.” Here, the idea of “unjustified” difference is mentioned. But what if the difference can somehow be justified? The notion of “merit” is key to the expression and experience of discrimination (we discuss this in relation to ethics later). It is not an objectively defined criterion, but one rooted in historical and current societal norms and inequalities.
|
||
Avraham (2017) explained in one short paragraph the dilemma of considering the problem of discrimination in insurance. “What is unique about insurance is that even statistical discrimination which by definition is absent of any malicious intentions, poses significant moral and legal challenges. Why? Because on the one hand, policy makers would like insurers to treat their insureds equally, without discriminating based on race, gender, age, or other characteristics, even if it makes statistical sense to discriminate (...) On the other hand, at the core of insurance business lies discrimination between risky and non-risky insureds. But riskiness often statistically correlates with the same characteristics policy makers would like to prohibit insurers from taking into account.” To illustrate this problem, and why writing about discrimination and insurance could be complicated, let us consider the example of “redlining”. Redlining has been an important issue (that we discuss further in Sect. 6.1.2), for the credit and the insurance industries, in the USA, that started in the 1930s. In 1935, the Federal Home Loan Bank Board (FHLBB) looked at more than 200 cities and created “residential security maps” to indicate the level of security for real-estate investments in each surveyed city. On the maps (see Fig. 1.1 with a collection of fictitious maps), the newest areas—those considered desirable for lending purposes—were outlined in green and known as “Type A”. “Type D” neighborhoods were outlined in red and were considered the most risky for mortgage support (on the left of Fig. 1.1). Those areas were indeed those with a high proportion of dilapidated (or dis-repaired) buildings (as we can observe on the right of Fig. 1.1). This corresponds to the first definition of “redline.”
|
||
Definition 1.3 (Redline (Merriam-Webster 2022)) To redline is (1) to withhold home-loan funds or insurance from neighborhoods considered poor economic risks; (2) to discriminate against in housing or insurance.
|
||
|
||
1.1 A Brief Overview on Discrimination
|
||
|
||
3
|
||
|
||
Fig. 1.1 Map (freely) inspired by a Home Owners’ Loan Corporation map from 1937, where red is used to identify neighborhoods where investment and lending were discouraged, on the lefthand side (see Crossney 2016 and Rhynhart 2020). In the middle, some risk-related variable (a fictitious “unsanitary index”) per neighborhood of the city is presented, and on the right-hand side, a sensitive variable (the proportion of Black people in the neighborhood). Those maps are fictitious (see Charpentier et al. 2023b)
|
||
In the 1970s, when looking at census data, sociologists noticed that red areas, where insurers did not want to offer coverage, were also those with a high proportion of Black people, and following the work of John McKnight and Andrew Gordon, “redlining” received more interest. On the map in the middle, we can observe information about the proportion of Black people. Thus, on the one hand, it could be seen as “legitimate” to have a premium for households that could reflect somehow the general conditions of houses. On the other hand, it would be discriminatory to have a premium that is a function of the ethnic origin of the policyholder. Here, the neighborhood, the “unsanitary index,” and the proportion of Black people are strongly correlated variables. Of course, there could be nonBlack people living in dilapidated houses outside of the red area, Black people living in wealthy houses inside the red area, etc. If we work using aggregated data, it is difficult to disentangle information about sanitary conditions and racial information, to distinguish “legitimate” and “nonlegitimate” discrimination, as discussed in Hellman (2011). Note that within the context of “redlining,” the utilization of census and aggregated data may introduce the potential for the occurrence of an “ecological fallacy” (as discussed in King et al. (2004) or Gelman (2009)). In the 2020s, we now have much more information (so called “big data era”) and more complex models (machine-learning literature), and we will see how to disentangle this complex problem, even if dealing with discrimination in insurance is probably still an illdefined unsolvable problem, with strong identification issues. Nevertheless, as we will see, there are many ways of looking at this problem, and we try, here, to connect the dots, to explain different perspectives.
|
||
|
||
4
|
||
1.1.2 Legal Perspective on Discrimination
|
||
|
||
1 Introduction
|
||
|
||
In Kansas, more than 100 years ago, a law was passed, allowing an insurance commissioner to review rates to ensure that they were not “excessive, inadequate, or unfairly discriminatory with regards to individuals,” as mentioned in Powell (2020). Since then, the idea of “unfairly discriminatory” insurance rates has been discussed in many States in the USA (see Box 1.1).
|
||
|
||
Box 1.1 “Unfairly discriminatory” insurance rates, according to US legislation Arkansas law (23/3/67/2/23-67-208), 1987 “A rate is not unfairly discriminatory in relation to another in the same class of business if it reflects equitably the differences in expected losses and expenses. Rates are not unfairly discriminatory because different premiums result for policyholders with like loss exposures but different expense factors, or with like expense factors but different loss exposures, if the rates reflect the differences with reasonable accuracy (...) A rate shall be deemed unfairly discriminatory as to a risk or group of risks if the application of premium discounts, credits, or surcharges among the risks does not bear a reasonable relationship to the expected loss and expense experience among the various risks.” Maine Insurance Code (24-A, 2303), 1969 “Risks may be grouped by classifications for the establishment of rates and minimum premiums. Classification rates may be modified to produce rates for individual risks in accordance with rating plans that establish standards for measuring variations in hazards or expense provisions, or both. These standards may measure any differences among risks that may have a probable effect upon losses or expenses. No risk classification may be based upon race, creed, national origin or the religion of the insured (...) Nothing in this section shall be taken to prohibit as unreasonable or unfairly discriminatory the establishment of classifications or modifications of classifications or risks based upon size, expense, management, individual experience, purpose of insurance, location or dispersion of hazard, or any other reasonable considerations, provided such classifications and modifications apply to all risks under the same or substantially similar circumstances or conditions.”
|
||
|
||
Unfortunately, as recalled in Vandenhole (2005), there is “no universally accepted definition of discrimination,” and most legal documents usually provide (non-exhaustive) lists of the grounds on which discrimination is to be prohibited. For example, in the International Covenant on Civil and Political Rights, “the law shall prohibit any discrimination and guarantee to all persons equal and effective protection against discrimination on any ground such as race, color, sex, language, religion, political or other opinion, national or social origin, property, birth or
|
||
|
||
1.1 A Brief Overview on Discrimination
|
||
|
||
5
|
||
|
||
other status” (see Joseph and Castan 2013). Such lists do not really address the question of what discrimination is. But looking for common features among those variables can be used to explain what discrimination is. For instance, discrimination is necessarily oriented toward some people based on their membership of a certain type of social group, with reference to a comparison group. Hence, our discourse should not center around the absolute assessment of how effectively an individual within a specific group is treated but rather on the comparison of the treatment that an individual receives relative to someone who could be perceived as “similar” within the reference group. Furthermore, the significance of this reference group is paramount, as discrimination does not merely entail disparate treatment, it necessitates the presence of a favored group and a disfavored group, thus characterizing a fundamentally asymmetrical dynamic. As Altman (2011), wrote, “as a reasonable first approximation, we can say that discrimination consists of acts, practices, or policies that impose a relative disadvantage on persons based on their membership in a salient social group.”
|
||
|
||
1.1.3 Discrimination from a Philosophical Perspective
|
||
As mentioned already, we should not expect to have universal rules about discrimination. For instance, Supreme Court Justice Thurgood Marshall claimed once that “a sign that says ‘men only’ looks very different on a bathroom door than on a courthouse door,” as reported in Hellman (2011). Nevertheless, philosophers have suggested definitions, starting with a distinction between “direct” and “indirect” discrimination. As mentioned in Lippert-Rasmussen (2014), it would be too simple to consider direct discrimination as intentional discrimination. A classic example would be a paternalistic employer who intends to help women by hiring them only for certain jobs, or for a promotion, as discussed in Jost et al. (2009). In that case, acts of direct discrimination can be unconscious in the sense that agents are unaware of the discriminatory motive behind decisions (related to the “implicit bias” discussed in Brownstein and Saul (2016a,b). Indirect discrimination corresponds to decisions with disproportionate effects, that might be seen as discriminatory even if that is not the objective of the decision process mechanism. A standard example could be the one where the only way to enter a public building is by a set of stairs, which could be seen as discrimination against people with disabilities who use wheelchairs, as they would be unable to enter the building; or if there were a minimum height requirement for a job where height is not relevant, which could be seen as discrimination against women, as they are generally shorter than men. On the one hand, for Young (1990), Cavanagh (2002), or Eidelson (2015), indirect discrimination should not be considered discrimination, which should be strictly limited to “intentional and explicitly formulated policies of exclusion or preference.” For Cavanagh (2002), in many cases, “it is not discrimination they object to, but its effects; and these effects can equally be brought about by other causes.” On the other hand, Rawls (1971) considered structural indirect discrimination, that is, when the
|
||
|
||
6
|
||
|
||
1 Introduction
|
||
|
||
rules and norms of society consistently produce disproportionately disadvantageous outcomes for the members of a certain group, relative to the other groups in society. Even if it is not intentional, it should be considered discriminatory.
|
||
Let us get back to the moral grounds, to examine why discrimination is considered wrong. According to Kahlenberg Richard (1996), racial discrimination should be considered “unfair” because it is associated with an immutable trait. Unfortunately, Boxill (1992) recalls that with such a definition, it would also be unfair to deny blind people a driver’s license. And religion challenges most definitions, as it is neither an immutable trait nor a form of disability. Another approach is to claim that discrimination is wrong because it treats persons on the basis of inaccurate generalizations and stereotypes, as suggested by Schauer (2006). For Kekes (1995), treating a person a certain way only because she is a member of a certain social group is inherently unfair, as stereotyping treats people unequally “without rational justification.” Thus, according to Flew (1993), racism is unfair because it treats individuals on the basis of traits that “are strictly superficial and properly irrelevant to all, or almost all, questions of social status and employability.” In other words, discrimination is perceived as wrong because it fails to treat individuals based on their merits. But in that case, as Cavanagh (2002) observed, “hiring on merit has more to do with efficiency than fairness,” which we will discuss further in the next section, on the economic foundations of discrimination. Finally, Lippert-Rasmussen (2006) and Arneson (1999, 2013) suggested looking at discrimination based on some consequentialist moral theory. In this approach, discrimination is wrong because it violates a rule that would be part of the social morality that maximizes overall moral value. Arneson (2013) writes that this view “can possibly defend nondiscrimination and equal opportunity norms as part of the best consequentialist public morality.”
|
||
A classical philosophical notion close to the idea of “nondiscrimination” is the concept of “equality of opportunity” (EOP). For Roemer and Trannoy (2016) “equality of opportunity” is a political ideal that is opposed to assigned-at-birth (caste) hierarchy, but not to hierarchy itself. To illustrate this point, consider the extreme case of caste hierarchy, where children acquire the social status of their parents. In contrast, “equality of opportunity” demands that the social hierarchy is determined by a form of equal competition among all members of the society. Rawls (1971) uses “equality of opportunity” to address the discrimination problem: everyone should be given a fair chance at success in a competition. This is also called “substantive equality of opportunity,” and it is often implemented through metrics such as statistical parity and equalized odds, which assume that talent and motivation are equally distributed among sub-populations. This concept can be distinguished from the “substantive equality of opportunity,” as defined in Segall (2013), where a person’s outcome should be affected only by their choices, not their circumstances.
|
||
|
||
1.1 A Brief Overview on Discrimination
|
||
|
||
7
|
||
|
||
1.1.4 From Discrimination to Fairness
|
||
|
||
Humans have an innate sense of fairness and justice, with studies showing that even 3-year-old children have demonstrated the ability to consider merit when sharing rewards, as shown by Kanngiesser and Warneken (2012), as well as chimpanzees and primates (Brosnan 2006), and many other animal species. And given that this trait is largely innate, it is difficult to define what is “fair,” although many scientists have attempted to define notions of “fair” sharing, as Brams et al. (1996) recalls. On the one hand “fair” refers to legality (and to human justice, translated into a set of laws and regulations), and in a second sense, “fair” refers to an ethical or moral concept (and to an idea of natural justice). The second reading of the word “fairness” is the most important here. According to one dictionary, fairness “consists in attributing to each person what is due to him by reference to the principles of natural justice.” And being “just” raises questions related to ethics and morality (we do not differentiate here between ethics and morality).
|
||
This has to be related to a concept introduced in Feinberg (1970), called “desert theory,” corresponding to the moral obligation that good actions must lead to better results. A student should deserve a good grade by virtue of having written a good paper, the victim of an industrial accident should deserve substantial compensation owing to the negligence of his or her employer. For Leibniz or Kant, a person is supposed to deserve happiness in virtue for being morally good. In Feinberg (1970)’s approach, “deserts” are often seen as positive, but they are also sometimes negative, like fines, dishonor, sanctions, condemnations, etc. (see Feldman (1995), Arneson (2007) or Haas (2013)). The concept of “desert” generally consists of a relationship among three elements: an agent, a deserved treatment or good, and the basis on which the agent is deserving.
|
||
We evoke in this book the “ethics of models,” or, as coined by Mittelstadt et al. (2016) or Tsamados et al. (2021), the “ethics of algorithms.” A nuance exists with respect to the “ethics of artificial intelligence,” which deals with our behavior or choices (as human beings) in relation to autonomous cars, for example, and which will attempt to answer questions such as “should a technology be adopted if it is more efficient?” The ethics of algorithms questions the choices made “by the machine” (even if they often reflect choices—or objectives—imposed by the person who programmed the algorithm), or by humans, when choices can be guided by some algorithm.
|
||
Programming an algorithm in an ethical way must be done according to a certain number of standards. Two types of norms are generally considered by philosophers. The first is related to conventions, i.e., the rules of the game (chess or Go), or the rules of the road (for autonomous cars). The second is made up of moral norms, which must be respected by everyone, and are aimed at the general interest. These norms must be universal, and therefore not favor any individual, or any group of individuals. This universality is fundamental for Singer (2011), who asks not to judge a situation with his or her own perspective, or that of a group to which one belongs, but to take a “neutral” and “fair” point of view.
|
||
|
||
8
|
||
|
||
1 Introduction
|
||
|
||
As discussed previously, the ethical analyses of discrimination are related to the concept of “equality of opportunity,” which holds that the social status of individuals depends solely on the service that they can provide to society. As the second sentence of Article 1 of the 1789 Declaration of the Human Rights states, “les distinctions sociales ne peuvent être fondées que sur l’utilité commune” (translated as1 “social distinctions may be founded only upon the general good”) or as Rawls (1971) points out, “offhand it is not clear what is meant, but we might say that those with similar abilities and skills should have similar life chances. More specifically, assuming that there is a distribution of natural assets, those who are at the same level of talent and ability, and have the same willingness to use them, should have the same prospects of success regardless of their initial place in the social system, that is, irrespective of the income class into which they are born.” In the deontological approach, inspired by Emmanuel Kant, one forgets the utilities of each person, and simply imposes norms and duties. Here, regardless of the consequences (for the community as a whole), some things are not to be done. A distinction is typically made between egalitarian and proportionalist approaches. To go further, Roemer (1996, 1998) propose a philosophical approach, whereas Fleurbaey and Maniquet (1996) and Moulin (2004) consider an economic vision. And in a more computational context, Leben (2020) goes back to normative principles to assess the fairness of a model.
|
||
All ethics courses feature thought experiments, such as the popular “streetcar dilemma.” In the original problem, stated in Foot (1967), a tram with no brakes is about to run over five people, and one of them has the opportunity to flip a switch that will cause the tram to swerve, but kill someone. What do we do? Or what should we do? Thomson (1976) suggested a different version, with a footbridge, where you can push a heavier person, who will crash into the track and die, but stop the tram. The latter version is often more disturbing because the action is indirect, and you start by murdering someone in order to save someone else. Some authors have used this thought experiment to distinguish between explanation (on scientific grounds, and based on causal arguments) and justification (based on moral precepts). This tramway experiment has been taken up in the moral psychology experiment, called, the Moral Machine project.2 In this “game,” one was virtually behind the wheel of a car, and choices were proposed: “Do you run over one person or five people?”, “Do you run over an elderly person or a child?”, “Do you run over a man or a woman?” Bonnefon (2019) revisits the experiment, and the series of moral dilemmas, where they obtained more than 40 million answers, from 130 countries. Although naturally, numbers of victims were an important feature (we prefer to kill fewer people), age was also very important (priority given to young people), and legal arguments seemed to emerge (we prefer to kill pedestrians who cross
|
||
|
||
1 See https://avalon.law.yale.edu/18th_century/rightsof.asp. 2 See https://www.moralmachine.net/.
|
||
|
||
1.1 A Brief Overview on Discrimination
|
||
|
||
9
|
||
|
||
outside the dedicated crossings). These questions are important for self-driving cars, as mentioned by Thornton et al. (2016).
|
||
For a philosopher, the question “How fair is this model to this group?” will always be followed by “How fair by what normative principle?” Measuring the overall effects on all those affected by the model (and not just the rights of a few) will lead to incorporating measures of fairness into an overall calculation of social costs and benefits. If we choose one approach, others will suffer. But this is the nature of moral choices, and the only responsible way to mitigate negative headlines is to develop a coherent response to these dilemmas, rather than ignore them. To speak of the ethics of models poses philosophical questions from which we cannot free ourselves, because, as we have said, a model is aimed at representing reality, “what is.” To fight against discrimination, or to invoke notions of fairness, is to talk about “what should be.” We are once again faced with the famous opposition of Hume (1739). It is a well-known property of statistical models, as well as of machinelearning ones. As Chollet (2021) wrote: “Keep in mind that machine learning can only be used to memorize patterns that are present in your training data. You can only recognize what you’ve seen before. Using machine learning trained on past data to predict the future is making the assumption that the future will behave like the past.” For when we speak of “norm,” it is important not to confuse the descriptive and the normative, or with other words, statistics (which tells us how things are) and ethics (which tells us how things should be). Statistical law is about “what is” because it has been observed to be so (e.g., humans are bigger than dogs). Human (divine, or judicial) law pertains to what is is because it has been decreed, and therefore ought to be (e.g., humans are free and equal or humans are good). One can see the “norm” as a regularity of cases, observed with the help of frequencies (or averages, as mentioned in the next chapter), for example, on the height of individuals, the length of sleep, in other words, data that make up the description of individuals. Therefore, anthropometric data have made it possible to define, for example, an average height of individuals in a given population, according to their age; in relation to this average height, a deviation of 20% more or less determines gigantism or dwarfism. If we think of road accidents, it may be considered “abnormal” to have a road accident in a given year, at an individual (micro) level, because the majority of drivers do not have an accident. However, from the insurer’s perspective (macro), the norm is that 10% of drivers have an accident. It would therefore be abnormal for no one to have an accident. This is the argument found in Durkheim (1897). From the singular act of suicide, if it is considered from the point of view of the individual who commits it, Durkheim tries to see it as a social act, therefore falling within a real norm, within a given society. From then on, suicide becomes, according to Durkheim, a “normal” phenomenon. Statistics then make it possible to quantify the tendency to commit suicide in a given society, as soon as one no longer observes the irregularity that appears in the singularity of an individual history, but a “social normality” of suicide. Abnormality is defined as “contrary to the usual order of things” (this might be considered an empirical, statistical notion), or “contrary to the right order of things” (this notion of right probably implies a normative definition), but also not conforming to the model.
|
||
|
||
10
|
||
|
||
1 Introduction
|
||
|
||
Defining a norm is not straightforward if we are only interested in the descriptive, empirical aspect, as actuaries do when they develop a model, but when a dimension of justice and ethics is also added, the complexity is bound to increase. We shall return in Chap. 4 to the (mathematical) properties that a “fair” or “equitable” model should be checked. Because if we ask a model to verify criteria not necessarily observed in the data, it is necessary to integrate a specific constraint into the modellearning algorithm, with a penalty related to a fairness measure (just as we use a “model complexity measure” to avoid overfit).
|
||
|
||
1.1.5 Economics Perspective on Efficient Discrimination
|
||
If jurists used the term “rational discrimination,” economists used the term “efficient” or “statistical discrimination,” such as in Phelps (1972) or Arrow (1973), following early work by Edgeworth (1922). Following Becker (1957) economists have tended to define discrimination as a situation where people who are “the same” (with respect to legitimate covariates) are treated differently. Hence, a “discrimination” corresponds here to some “disparity,” but we will frequently use the term “discrimination.” More precisely, it is necessary to distinguish two standards. One standard corresponds to “disparate treatment,” corresponding to “any economic agent who applies different rules to people in protected groups is practicing discrimination” as defined in Yinger (1998). The second discriminatory standard corresponds to “disparate impact.” This corresponds to practices that seem to be neutral, but have the effect of disadvantaging one group more than others.
|
||
Definition 1.4 (Disparate Treatment (Merriam-Webster 2022)) Disparate treatment corresponds to the treatment of an individual (as an employee or prospective juror) that is less favorable than treatment of others for discriminatory reasons (such as race, religion, national origin, sex, or disability).
|
||
Definition 1.5 (Disparate Impact (Merriam-Webster 2022)) Disparate impact corresponds to an unnecessary discriminatory effect on a protected class caused by a practice or policy (as in employment or housing) that appears to be nondiscriminatory.
|
||
In labor economics, wages should be a function of productivity, which is unobservable when signing a contract, and therefore, as discussed in Riley (1975), Kohlleppel (1983) or Quinzii and Rochet (1985), employers try to find signals. As claimed in Lippert-Rasmussen (2013), statistical discrimination occurs when “there is statistical evidence which suggests that a certain group of people differs from other groups in a certain dimension, and its members are being treated disadvantageously on the basis of this information.” Those signals are observable variables that correlate with productivity.
|
||
In the most common version of the model, employers use observable group membership as a proxy for unobservable skills, and rely on their beliefs about pro-
|
||
|
||
1.1 A Brief Overview on Discrimination
|
||
|
||
11
|
||
|
||
ductivity correlates, in particular their estimates of average productivity differences between groups, as in Phelps (1972), Arrow (1973), or Bielby and Baron (1986). A variant of this theory is when there are no group differences in average productivity, but rather based on the belief that the variance in productivity is larger for some groups than for others, as in Aigner and Cain (1977) or Cornell and Welch (1996). In these cases, risk-adverse employers facing imperfect information may discriminate against groups with larger expected variances in productivity. According to England (1994), “statistical discrimination” might explain why there is still discrimination in a competitive market. For Bertrand and Duflo (2017) “statistical discrimination” is a “more disciplined explanation” than the taste-based model initiated by Becker (1957), because the former “does not involve an ad hoc (even if intuitive) addition to the utility function (animus toward certain groups) to help rationalize a puzzling behavior.”
|
||
Here, “statistical discrimination,” rather than simply providing an explanation, can lead people to see social stereotypes as useful and acceptable, and therefore help to rationalize and justify discriminatory decisions. As suggested by Tilcsik (2021), economists have theorized labor market discrimination, have constructed mathematical models that attribute discrimination to the deliberate actions of profitmaximizing firms or utility-maximizing individuals (as discussed in Charles and Guryan (2011) or Small and Pager (2020)). And this view of discrimination has influenced social science debates, legal decisions, corporate practices, and public policy discussions, as mentioned in Ashenfelter and Oaxaca (1987), Dobbin (2001), Chassonnery-Zaïgouche (2020), or Rivera (2020). The most influential economic model of discrimination is probably the “statistical discrimination theory,” discussed in the 1970s, with Phelps (1972), Arrow (1973), and Aigner and Cain (1977). Applied to labor markets, this theory claims that employers have imperfect information about the future productivity of job applicants, which leads them to use easily observable signals, such as race or gender, to infer the expected productivity of applicants, as explained in Correll and Benard (2006). Employers who practice “statistical discrimination” rely on their beliefs about group statistics to evaluate individuals (corresponding to “discrimination” as defined in Definition 1.1). In this model, discrimination does not arise from a feeling of antipathy toward members of a group, it is seen as a rational solution to an information problem. Profit-maximizing employers use all the information available to them and, as individual-specific information is limited, they use group membership as a “proxy.” Economists tend to view “statistical discrimination” as “the optimal solution to an information extraction problem” and sometimes describe it as “efficient” or “fair,” as in Autor (2003), Norman (2003) and Bertrand and Duflo (2017). It should be stressed here that this approach, initiated in the 1970s in the context of labor economics is essentially the same as the one underlying the concept of “actuarial fairness.” Observe finally that the word “statistical” used here reinforces the image of discrimination as a rational, calculated decision, even though several models do not assume that employers’ beliefs about group differences are based on statistical data, or any other type of systematic evidence. Employers’ beliefs might be based
|
||
|
||
12
|
||
|
||
1 Introduction
|
||
|
||
on partial or idiosyncratic observations. As mentioned in Bohren et al. (2019) it is possible to have “statistical discrimination with bad statistics” here.
|
||
|
||
1.1.6 Algorithmic Injustice and Fairness of Predictive Models
|
||
Although economists published extensively on discrimination in the job market in the 1970s, the subject has come back into the spotlight following a number of publications linked to predictive algorithms. Correctional Offender Management Profiling for Alternative Sanctions, or compas, a tool widely used as a decision aid in the US courts to assess a criminal’s chance of re-offending, based on some risk scales for general and violent recidivism, and for pretrial misconduct. After several months of investigation, Angwin et al. (2016) looked back at the output of compas in a series of articles called “Machine Bias” (and subtitled “Investigating Algorithmic Injustice”).
|
||
As pointed out (Feller et al. 2016), if we look at data from the compas dataset (from the fairness R package), in Fig. 1.2, on the one hand (on the left of the figure)
|
||
• for Black people, among those who did not re-offend, 42% were classified as high risk
|
||
• for white people, among those who did not re-offend, 22% were classified as high-risk
|
||
With standard terminology in classifiers and decision theory, the false-negative rate is about two times higher for Black people (42% against 22%). As Larson et al. (2016) wrote: “Black defendants were often predicted to be at a higher risk of
|
||
|
||
Fig. 1.2 Two analyses of the same descriptive statistics of compas data, with the number of defendant (1) function of the race of the defendant (Black and white), (2) the risk category, obtained from a classifier (binary, low, and high), and (3) the indicator that the defendants re-offended, or not. On the left-hand side, the analysis of Dieterich et al. (2016) and on the right-hand side, that of Feller et al. (2016)
|
||
|
||
1.1 A Brief Overview on Discrimination
|
||
|
||
13
|
||
|
||
recidivism than they actually were.” On the other hand (on the right-hand side of the figure), as Dieterich et al. (2016) observed:
|
||
• For Black people, among those who were classified as high risk, 35% did not re-offend.
|
||
• For white people, among those who were classified as high risk, 40% did not re-offend.
|
||
Therefore, as the rate of recidivism is approximately equal at each risk score level, irrespective of race, it should not be claimed that the algorithm is racist. The initial approach is called “false positive rate parity,” whereas the second one is called “predictive parity.” Obviously, there are reasonable arguments in favor of both contradictory positions. From this simple example, we see that having a valid and common definition of “fairness” or “parity” will be complicated.
|
||
Since then, many books and articles have addressed the issues highlighted in this article, namely the increasing power of these predictive decision-making tools, their ever-increasing opacity, the discrimination they replicate (or amplify), the ‘biased’ data used to train or calibrate these algorithms, and the sense of unfairness they produce. For instance, Kirkpatrick (2017) pointed out that “the algorithm itself may not be biased, but the data used by predictive policing algorithms is colored by years of biased police practices.”
|
||
And justice is not the only area where such techniques are used. In the context of predictive health systems, Obermeyer et al. (2019) observed that a widely used health risk prediction tool (predicting how sick individuals are likely to be, and the associated health care cost), that is applied to roughly 200 million individuals in the USA per year, exhibited significant racial bias. More precisely, 17.7% of patients that the algorithm assigned to receive “extra care” were Black, and if the bias in the system was corrected for, as Ledford (2019) did, the percentage should increase to 46.5%. Those “correction” techniques will be discussed in Part IV of this book, when presenting “mitigation.”
|
||
Massive data, and machine-learning techniques, have provided an opportunity to revisit a topic that has been explored by lawyers, economists, philosophers, and statisticians for the past 50 years or longer. The aim here is to revisit these ideas, to shed new light on them, with a focus on insurance, and explore possible solutions. Lawyers, in particular, have discussed these predictive models, this “actuarial justice,” as Thomas (2007), Harcourt (2011), Gautron and Dubourg (2015), or Rothschild-Elyassi et al. (2018) coined it.
|
||
The idea of bias and algorithmic discrimination is not a new one, as shown for instance by Pedreshi et al. (2008). However, over the past 20 years, the number of examples has continued to increase, with more and more interest in the media. “AI biases caused 80% of black mortgage applicants to be rejected” in Hale (2021), or “How the use of AI risks recreating the inequity of the insurance industry of the previous century” in Ito (2021). Pursuing David’s 2015 analysis, McKinsey (2017) announced that artificial intelligence would disrupt the workplace (including the insurance and banking sectors, Mundubeltz-Gendron (2019)) particularly to
|
||
|
||
14
|
||
|
||
1 Introduction
|
||
|
||
replace lackluster repetitive (human) work.3 These replacements raise questions, and compel the market and the regulator to be cautious. For Reijns et al. (2021), “the Dutch insurance sector makes it a mandate,” in an article on “ethical artificial intelligence,” and in France, Défenseur des droits (2020) recalls that “algorithmic biases must be able to be identified and then corrected” because “non-discrimination is not an option, but refers to a legal framework.” Bergstrom and West (2021) note that there are people writing a bill of rights for robots, or devising ways to protect humanity from super-intelligent, Terminator-like machines, but that getting into the details of algorithmic auditing is often seen as boring, but necessary.
|
||
Living with blinders on, or closing our eyes, rarely solves problems, although it has long been advocated as a solution to discrimination. As Budd et al. (2021) show, reverting to an Amazon experiment of removing names from CVs to eliminate gender discrimination does not work, because by hiding the candidate’s name, the algorithm continued to preferentially choose men over women. Why did this happen? Simply because Amazon trained the algorithm from its existing resumes, with an over-representation of men, and there are elements of a resume (apart from the name) that can reveal a person’s gender, such as a degree from a women’s university, membership of a female professional organisation, or a hobby where the sexes are disproportionately represented. Proxies that correlate more or less with the “protected” variables may sustain a form of discrimination.
|
||
In this textbook, we address these issues, limiting ourselves to actuarial models in an insurance context, and almost exclusively, the pricing of insurance contracts. In Seligman (1983), the author asks the following basic question: “If young women have fewer car accidents than young men—which is the case—why shouldn’t women get a better rate? If industry experience shows—which it does—that women spend more time in hospital, why shouldn’t women pay more?” This type of question will be the starting point in our considerations in this textbook.
|
||
Paraphrasing Georges Clémenceau,4 who said (in 1887) that “war is too serious a thing to be left to the military,” Worham (1985) argued that insurance segmentation was too important a task to be left to actuaries. Forty years later, we might wonder whether it is not worse to leave it to algorithms, and to clarify actuaries’ role in these debates. In the remainder, we begin by reviewing insurance segmentation and the foundations of actuarial pricing of insurance contracts. We then review the various terms mentioned in the title, namely the notion of “bias,” “discrimination,” and “fairness,” while proposing a typology of predictive models and data (in particular, the so-called “sensitive” data, which may be linked to possible discrimination).
|
||
|
||
3 Even if it seems exaggerated, because on the contrary, it is often humans who perform the repetitive tasks to help robots: “in most cases, the task is repetitive and mechanical. One worker explained that he once had to listen to recordings to find those containing the name of singer Taylor Swift in order to teach the algorithm that it is a person” as reported by Radio Canada in April 2019. 4 Member of the Chamber of Deputies from 1885 and 1893 and then Prime Minister of France from 1906 to 1909 and again from 1917 until 1920.
|
||
|
||
1.2 From Words and Concepts to Mathematical Formalism
|
||
|
||
15
|
||
|
||
1.1.7 Discrimination Mitigation and Affirmative Action
|
||
|
||
Mitigating discrimination is usually seen as paradoxical, because in order to avoid discrimination, we must create another discrimination. More precisely, Supreme Court Justice Harry Blackmun stated, in 1978, “in order to get beyond racism, we must first take account of race. There is no other way. And in order to treat some persons equally, we must treat them differently.” cited in Knowlton (1978), as mentioned in Lippert-Rasmussen (2020)). More formally, an argument in favor of affirmative action—called “the present-oriented anti-discrimination argument”—is simply that justice requires that we eliminate or at least mitigate (present) discrimination by the best morally permissible means of doing so, which corresponds to affirmative action. Freeman (2007) suggested a “time-neutral anti-discrimination argument,” in order to mitigate past, present, or future discrimination. But there are also arguments against affirmative action, corresponding to “the reverse discrimination objection,” as defined in Goldman (1979): some might consider that there is an absolutely ethical constraint against unfair discrimination (including affirmative action). To quote another Supreme Court Justice, in 2007, John G. Roberts of the US Supreme Court submits: “The way to stop discrimination on the basis of race is to stop discriminating on the basis of race” (Turner (2015) and Sabbagh (2007)). The arguments against affirmative action are usually based on two theoretical moral claims, according to Pojman (1998). The first denies that groups have moral status (or at least meaningful status). According to this view, individuals are only responsible for the acts they perform as specific individuals and, as a corollary, we should only compensate individuals for the harms they have specifically suffered. The second asserts that a society should distribute its goods according to merit.
|
||
|
||
1.2 From Words and Concepts to Mathematical Formalism
|
||
1.2.1 Mathematical Formalism
|
||
The starting point of any statistical or actuarial model is to suppose that observations are realizations of random variables, in some probabilistic space .( , F, P) (see Rolski et al. (2009), for example, or any actuarial textbook). Therefore, let .P denote the “true” probability measure, associated with random variables .(Z, Y ) = (S, X, Y ). Here, features .Z can be split into a couple .(S, X), where .X is the nonsensitive information whereas S is the sensitive attribute.5 Y is the outcome we want to model, which would correspond to the annual loss of a given insurance policy (insurance pricing), the indicator of a false claim (fraud detection), the number of visits to
|
||
5 For simplicity, in most of the book, we discuss the case where S is a single sensitive attribute.
|
||
|
||
16
|
||
|
||
1 Introduction
|
||
|
||
the dentist (partial information for insurance pricing), the occurrence of a natural
|
||
catastrophe (claims management), the indicator that the policyholder will purchase insurance to a competitor (churn model), etc. Thus, here, we have a triplet .(S, X, Y ), defined on .S × X × Y, following some unknown distribution .P. And classically, .Dn = {(zi, yi )} = {(si, xi, yi )}, where .i = 1, 2, · · · , n, will denote a dataset, and .Pn will denote the empirical probabilities associated with sample .Dn.
|
||
It is always assumed in this book that .S is somehow fixed in advance, and is not learnt: gender is considered as a binary categorical variable, sensitive and protected. In most cases, s will be a categorical variable, and in order to avoid heavy notations, we simply consider a binary sensitive attribute (denoted .s ∈ {A, B} to remain quite general, and avoid .{0, 1} not to get confused with values taken by y in a classification problem). Recently, Hu et al. (2023b) discussed the case where .s is a vector of multivariate attributes (of course possibly correlated). .Y depends on the model considered: in a classification problem, .Y usually corresponds to .{0, 1}, whereas in a regression problem, .Y corresponds to the real line .R. We can also consider counts, when .y ∈ N (i.e., .{0, 1, 2, · · · }). We do not discuss here the case where .y is a collection of multiple predictions (also coined “multiple tasks” in the
|
||
machine-learning literature, see for example Hu et al. (2023a) for applications in the
|
||
context of fairness). Throughout the book, we consider models that are formally functions .m :
|
||
S × X → Y, that will be estimated from our training dataset .Dn. Considering models .m : X → Y (sometimes coined “gender-blind” if s denotes the gender, or “color-blind” if s denotes the race, etc.) is supposed to create a more “fair” model, unfortunately, in a very weak sense (as many variables in .x might be strongly correlated with s). After estimating a model, we can use it to obtain predictions, denoted .y, while .m(x) (or .m(x, s)) will be called the “score”, when y is a binary variable take values in .{0, 1}.
|
||
|
||
1.2.2 Legitimate Segmentation and Unfair Discrimination
|
||
In the previous section, we have tried to explain that there could be “legitimate” and “illegitimate” discrimination, “fair” and “unfair.” We consider here a first attempt to illustrate that issue, with a very simple dataset (with simulated data). Consider a risk, and let y denote the occurrence of that risk (hence, y is binary). As we discuss in Chap. 2, it is legitimate to ask policyholders to pay a premium that is proportional to .P[Y = 1], the probability that the risk occurs (which will be the idea of “actuarial fairness”). Assume now that this occurrence is related to a single feature x : the larger x, the more likely the risk will occur. A classic example could be the occurrence of the death of a person, where x is the age of that person. Here, the correlation between y and x is coming from a common (unobserved) factor, .x0. In a small dataset, toydata1 (divided into a training dataset, toydata1_train, and a validation dataset, toydata1_validation), we have simulated values, where the confounding variable .X0 (that will not be observed, and therefore cannot
|
||
|
||
1.2 From Words and Concepts to Mathematical Formalism
|
||
|
||
17
|
||
|
||
be used in the modeling process) is a Gaussian variable, .X0 ∼ N(0, 1), and then
|
||
|
||
⎧ ⎪⎪⎨X = X0 + , ∼ N(0, 1/22),
|
||
|
||
.
|
||
|
||
⎪⎪⎩SY
|
||
|
||
= =
|
||
|
||
1(X0 1(X0
|
||
|
||
+η +ν
|
||
|
||
> >
|
||
|
||
0), 0),
|
||
|
||
η ∼ N(0, 1/22), ν ∼ N(0, 1/22).
|
||
|
||
The sensitive attribute s, which takes values 0 (or A) and 1 (or B), does not influence y, and therefore it might not be legitimate to use it (it could be seen as an “illegitimate discrimination”). Note that .x0 influences all variables, x, s, and y (with a probit model for the last two), and because of that unobserved confounding variable .x0, all variables are here (strongly) correlated. In Fig. 1.3, we can visualize the dependence between x and y (via boxplots of x given y) on the left-hand side,
|
||
|
||
Fig. 1.3 On top, boxplot of x conditional on y, with .y ∈ {0, 1} on the left-hand side, and conditional on s, with .s ∈ {A, B} on the right-hand side, from the toydata1 dataset. Below, the curve on the left-hand side is .x → P[Y = 1|X = x] whereas the curve on the right-hand side is .x → P[S = A|X = x]. Hence, when .x = +1, .P[Y = 1|X = x] ∼ 25%, and therefore .P[Y = 0|X = x] ∼ 75% (on the left-hand side), whereas when .x = +1, .P[S = A|X = x] ∼ 95%, and therefore .P[S = B|X = x] ∼ 5% (on the right-hand side)
|
||
|
||
18
|
||
|
||
1 Introduction
|
||
|
||
and between x and s (via boxplots of x given s) on the right-hand side. For example, if .x ∼ −1, then y takes values in .{0, 1} respectively with 25% and 75% chance. It is a 75% and 25% chance if .x ∼ +1. Similarly, when .x ∼ −1, s is four times more likely to be in group A than in group B.
|
||
When fitting a logistic regression to predict y based on both x and s, from toydata1_train, observe that variable x is clearly significant, but not s (using glm in R, see Sect. 3.3 for more details about standard classifiers, starting with the logistic regression):
|
||
|
||
Coefficients:
|
||
|
||
Estimate Std. Error z value Pr(>|z|)
|
||
|
||
(Intercept) -0.2983
|
||
|
||
0.2083 -1.432 0.152
|
||
|
||
x
|
||
|
||
1.0566
|
||
|
||
0.1564 6.756 1.41e-11 ***
|
||
|
||
s == A
|
||
|
||
0.2584
|
||
|
||
0.2804 0.922 0.357
|
||
|
||
Without the sensitive variable s, we obtain a logistic regression on x only, that could be seen as “fair through unawareness.” The estimation yields
|
||
|
||
Coefficients:
|
||
|
||
Estimate Std. Error z value Pr(>|z|)
|
||
|
||
(Intercept) -0.1390
|
||
|
||
0.1147 -1.212 0.226
|
||
|
||
x
|
||
|
||
1.1344
|
||
|
||
0.1333 8.507 <2e-16 ***
|
||
|
||
Here, .m(x), that estimates .E[Y |X = x], is equal to
|
||
|
||
exp[−0.1390 + 1.1344 x] .m(x) = 1 + exp[−0.1390 + 1.1344 x] .
|
||
|
||
But it does not mean that this model is perceived as “fair” by everyone. In Fig. 1.4, we can visualize the probability that scores .m exceed a given threshold t, here .50%. Even without using s as a feature in the model, .P[m(X) > t|S = s] does depend on s, whatever the threshold t. And if .E[m(X)] ∼ 50%, observe that .E[m(X)|S = A] ∼ 65% while .E[m(X)|S = B] ∼ 25%. With our premium interpretation, it means that, on average, people that belong in group A pay a premium at least twice that paid by people in group B. Of course, ceteris paribus it is not the case, as individuals with the same x have the same prediction, whatever s, but overall, we observe a clear difference. One can easily transfer this simple example to many real-life applications.
|
||
Throughout this book, we provide examples of such situations, then formalize some measures of fairness, and finally discuss methods used to mitigate a possible discrimination in a predictive model .m, even if .m is not a function of the sensitive attribute (fairness through unawareness).
|
||
|
||
1.3 Structure of the Book
|
||
|
||
19
|
||
|
||
Fig. 1.4 Distribution of the score .m(X, S), conditional on A and B, on the left-hand side, and distribution of the score .m(X) without the sensitive variable, conditional on A and B, on the righthand side (fictitious example). In both cases, logistic regressions are considered. From this score, we can get a classifier .y = 1(m(z) > t) (where .z is either .(x, s), on the left-hand side, or simply .x, on the right-hand side). Here, we consider cut-off .t = 50%. Areas on the right of the vertical line (at .t = 50%) correspond to the proportion of individuals classified as .y = 1, in both groups, A and B
|
||
1.3 Structure of the Book
|
||
In Part I we get back to insurance and predictive modeling. In Chap. 2, we present applications of predictive modeling in insurance, emphasizing insurance ratemaking and premium calculations, first in the context of homogeneous policyholders, and then in that of heterogeneous policyholders. We will discuss “segmentation” from a general perspective, the statistical approach being discussed in Chap. 3. In that chapter, we present standard supervised models, with general linearized models (GLMs), penalized versions, neural nets, trees, and ensemble approaches. In Chap. 4, we then address the questions of interpretation and explanation of predictive models, as well as accuracy and calibration.
|
||
In Part II, we discuss further segmentation and discrimination and sensitive attributes in the context of insurance modeling. In Chap. 5, we provide a classification and a typology of pricing variables. In Chap. 6, we discuss direct discrimination (with race, gender, age, and genetics), and indirect direction We return to biases and data, in Chap. 7, with a discussion about observations and experiments. We return to how data are collected before getting back to the popular adage “correlation is not causation,” and start to discuss causal inference and counterfactuals.
|
||
In Part III, we present various approaches to quantify fairness, with a focus in Chap. 8, on “group discrimination” concepts, whereas “individual fairness” is presented in Chap. 9.
|
||
And finally, In Part IV, we discuss the mitigation of discrimination, using three approaches: the pre-processing approach, in Chap. 10, the in-processing approach in Chap. 11, and the post-processing approach in Chap. 12.
|
||
|
||
20
|
||
1.4 Datasets and Case Studies
|
||
|
||
1 Introduction
|
||
|
||
In the following chapters, and more specifically in Parts III and IV, we use both generated data and publicly available real datasets to illustrate various techniques, either to quantify a potential discrimination (in Part III) or to mitigate it (in Part IV). All the datasets are available from the GitHub repository,6 in R.
|
||
> library(devtools) > devtools::install_github("freakonometrics/InsurFair") > library(InsurFair)
|
||
The first toy dataset is the one discussed previously in Sect. 1.2.2, with toydata1_train and toydata1_valid, with (only) three variables y (binary outcome), s (binary sensitive attribute) and x (drawn from a Gaussian variable).
|
||
> str(toydata1_train) ’data.frame’: 600 obs. of three variables:
|
||
$ x : num 0.7939 0.5735 0.9569 0.1299 -0.0606 ... $ s : Factor w/ 2 levels "B","A": 1 1 2 1 2 2 2 1 1 1 ... $ y : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 2 1 ...
|
||
As discussed, the three variables are correlated, as they are all based on an unobserved common variable z.
|
||
The toydata2 dataset consists in two generated data, .n = 5000 are used as a training sample, and .n = 1000 are used for validation. The process used to generate the data is the following:
|
||
• The binary sensitive attribute, .s ∈ {A, B}, is drawn, with respectively .60% and .40% individuals in each group
|
||
• .(x1, x3) ∼ N(μs, Σs), with some correlation of .0.4 when .s = A and .0.7 when .s = B
|
||
• .x2 ∼ U([0, 10]), independent of .x1 and .x3 • .η = β0 + β1x1 + β2x2 + β3x12 + β41B(s), that does not depend on .x3 • .y ∼ B(p), where .p = exp(η)/[1 + exp(η)] = μ(x1, x2, s).
|
||
In Fig. 1.5, we can visualize scatter plots with .x1 on the x-axis and .x2 on the y-axis, with on the left-hand side, colors depending on y (.y ∈ {GOOD , BAD}, or .y ∈ {0 , 1}) and depending on s (.s ∈ {A , B}) on the right-hand side. In Fig. 1.6, we can visualize level curves of .(x1, x2) → μ(x1, x2, A) on the left and .(x1, x2) → μ(x1, x2, B) on the right-hand side, where .μ(x1, x2, s) are the true probabilities used to generate the dataset. Colors reflect the value of the probability (on the right part) and are coherent with .{GOOD , BAD}.
|
||
|
||
6 See Charpentier (2014) for a general overview on the use of R in actuarial science. Note that some packages mentioned here also exist in Python, in scikit-learn, as well as packages dedicated to fairness, such as fairlearn, or aif360).
|
||
|
||
1.4 Datasets and Case Studies
|
||
|
||
21
|
||
|
||
Fig. 1.5 Scatter plot on toydata2, with .x1 on the x-axis and .x2 on the y-axis, with on the lefthand side, colors depending on the outcome y (.y ∈ {GOOD , BAD}, or .y ∈ {0 , 1}) and depending on the sensitive attribute s (.s ∈ {A , B}) on the right-hand side
|
||
|
||
True model, sensitive = A
|
||
|
||
True model, sensitive = B
|
||
|
||
0 20 40 60 80 100
|
||
|
||
1 0
|
||
|
||
10
|
||
|
||
8
|
||
|
||
8
|
||
|
||
6
|
||
|
||
6
|
||
|
||
x 2
|
||
|
||
x 2
|
||
|
||
4
|
||
|
||
4
|
||
|
||
2
|
||
|
||
2
|
||
|
||
0
|
||
|
||
0
|
||
|
||
–4
|
||
|
||
–2
|
||
|
||
0
|
||
|
||
2
|
||
|
||
4
|
||
|
||
–4
|
||
|
||
–2
|
||
|
||
0
|
||
|
||
2
|
||
|
||
4
|
||
|
||
x1
|
||
|
||
x1
|
||
|
||
Fig. 1.6 Level curves of .(x1, x2) → μ(x1, x2, A) on the left-hand side and .(x1, x2) → μ(x1, x2, B) on the right-hand side, the true probabilities used to generate the toydata2 dataset. The blue area in the lower-left corner corresponds to .y close to .0 (blue) (or GOOD (blue) risk),
|
||
whereas the red area in the upper right corner corresponds to .y close to .1 (red) (or BAD (red) risk)
|
||
|
||
Then, there will be real data. The GermanCredit dataset, collected in Hofmann (1990) and used in the CASdataset package, from Charpentier (2014), contains 1000 observations and 23 attributes. The variable of interest y is a binary variable indicating whether a person experienced a default of payment. There are .70% of 0’s (“good” risks), .30% of 1’s (“bad” risks). The sensitive attribute is the gender of the person (binary, with 69% women (B) and 31% men (A), but we can also use the age, treated as categorical.
|
||
The FrenchMotor datasets, from Charpentier (2014), are in personal motor insurance, with underwriting data, and information about claim occurrence (here considered as binary). It is obtained as the aggregation of freMPL1, freMPL2,
|
||
|
||
22
|
||
|
||
1 Introduction
|
||
|
||
freMPL3 and freMPL4 from the CASdataset R package, while keeping only observations with exposure exceeding .90%. Here, the sensitive attribute is .s = Gender, which is a binary feature, and the goal is to create a score that reflects the probability of claiming a loss (during the year). The entire dataset contains .n = 12,437 policyholders and 18 variables. A subset with 70% of the observations is used for training, and 30% are used for observation. Note that variable SocioCateg contains here nine categories (only the first digit in the categories is considered). In numerical applications, two specific individuals (named Andrew and Barbara) are considered, to illustrate various points.
|
||
The telematic dataset is an original dataset, containing 1177 insurance contracts, observed over 2 years. We have claims data for 2019 (here claim is binary, no or yes, 13% of the policyholders claimed a loss), the age (age) and the gender (gender) of the driver, as well as some telematic data for 2018 (including Total_Distance, Total_Time, as well as Drive_Score, Style_Score, Corner_Score, Acceleration_Score or Braking_Score, in addition to some binary scores related to “heavy” acceleration or braking).
|
||
|
||
Part I
|
||
Insurance and Predictive Modeling
|
||
Predictive modeling involves the use of data to forecast future events. It relies on capturing relationships between explanatory variables and the predicted variables from past occurrences and exploiting these relationships to predict future outcomes. Forecasting future financial events is a core actuarial skill—actuaries routinely apply predictive modeling techniques in insurance and other risk management application, Frees et al. (2014a). The sciences do not try to explain, they hardly even try to interpret, they mainly make models. By a model is meant a mathematical construct which, with the addition of certain verbal interpretations, describes observed phenomena. The justification of such a mathematical construct is solely and precisely that it is expected to work—that is, correctly to describe phenomena from a reasonably wide area, Von Neumann (1955). In economic theory, as in Harry Potter, the Emperor’s New Clothes or the tales of King Solomon, we amuse ourselves in imaginary worlds. Economic theory spins tales and calls them models. An economic model is also somewhere between fantasy and reality. Models can be denounced for being simplistic and unrealistic, but modeling is essential because it is the only method we have of clarifying concepts, evaluating assumptions, verifying conclusions and acquiring insights that will serve us when we return from the model to real life. In modern economics, the tales are expressed formally: words are represented by letters. Economic concepts are housed within mathematical structures, Rubinstein (2012).
|
||
|
||
Chapter 2
|
||
Fundamentals of Actuarial Pricing
|
||
|
||
Abstract “Insurance is the contribution of the few to the misfortune of the many” is a simple way to describe what insurance is. But it doesn’t say what the “contribution” should be, to be fair. In this chapter, we return to the fundamentals of pricing and risk sharing, and at the end we mention other models used in insurance (to predict future payments to be provisioned, to create a fraud score, etc.).
|
||
Even though insurers will not be able to predict which of their clients will suffer a loss, they should be capable of estimating probabilities to claim a loss, and possibly the distribution of their aggregate losses, with an acceptable margin of error, and budgeting accordingly. The role of actuaries is to run statistical analysis to measure individual risk and price it.
|
||
|
||
2.1 Insurance
|
||
The insurance business is characterized by an inverted production cycle. In return for a premium—the amount of which is known when the contract is taken out—the insurer undertakes to cover a risk, the unknown date and amount, according to the definition of “actuarial pricing.” In order to do this, the insurer pools the risks within a mutuality. The universal secret of insurance is therefore the pooling of a large number of insurance contracts within a mutuality, in order to allow compensation to be made between the risks that have been damaged and those for which the insurer has collected premiums without having had to pay out any benefits, as Petauton (1998) argues. To use Chaufton (1886)’s formulation, insurance is the “compensation of the effects of chance by mutuality organised according to the laws of statistics.” The first important concept is “mutualization.”
|
||
Definition 2.1 (Mutuality (Wilkie 1997)) Mutuality is considered to be the normal form of commercial private insurance, where participants contribute to the risk pool through a premium that relates to their particular risk at the time of
|
||
|
||
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
|
||
|
||
25
|
||
|
||
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
|
||
|
||
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_2
|
||
|
||
26
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
the application, i.e., the higher the risk that they bring to the pool, the higher the premium required.
|
||
Through effective underwriting, Wilkie (1997) claims that “the risk is evaluated by the insurer as thoroughly as possible, based on all the facts that are relevant and available.” Participation in mutual insurance schemes is voluntary and the amount of cover that the individual purchases is discretionary. An essential feature of mutual insurance is segmentation, or discrimination in underwriting, leading to significant differences in premium rates for the same amount of life cover for different participants. Viswanathan (2006) gives several examples. The second concept is “solidarity.”
|
||
Definition 2.2 (Solidarity (Wilkie 1997)) Solidarity is the basis of most national or social insurance schemes. Participation in such state-run schemes is generally compulsory and individuals have no discretion over their level of cover. All participants normally have the same level of cover. In solidarity schemes the contributions are not based on the expected risk of each participant.
|
||
In those state-run schemes, contributions are often just equal for all, or it can be according to the individual ability to pay (such as percentage of income). As everybody pays the same contribution rate, the low-risk participants are effectively subsidizing the high-risk participants. With an insurance economics perspective, agents make decisions individually, forgetting that the decisions they make often go beyond their narrow self-interest, reflecting instead broader community and social interests, even in situations where they are not known to each other. This is not altruism, per se, but rather a notion of strong reciprocity, the “predisposition to cooperate even when there is no apparent benefit in doing so,” as formalized in Gintis (2000) and Bowles and Gintis (2004).
|
||
Solidarity is important in insurance. In most countries, employer-based health insurance includes maternity benefits for everyone. In the USA, a federal law says it is discriminatory not to do so (the “Pregnancy Discrimination Act” (PDA) is an amendment to the Civil Rights Act of 1964 that was enacted in 1978). “Yes, men should pay for pregnancy coverage, and here’s why, said Hiltzik (2013), it takes two to tango.” No man has ever given birth to a baby, but it’s also true that no baby has ever been born without a man being involved somewhere along the line. “Society has a vested interest in healthy babies and mothers” and “universal coverage is the only way to make maternity coverage affordable”; therefore, solidarity is imposed, and men should pay for pregnancy coverage.
|
||
One should probably stress here that insurance is not used to eliminate the risk, but to transfer it, and this transfer is done according to a social philosophy chosen by the insurer. With “public insurance,” as Ewald (1986) reminds us, the goal is to transfer risk from individuals to a wider social group, by “socialising,” or redistributing risk “more fairly within the population.” Thus, low-risk individuals pay insurance premiums at a higher rate than their risk profile would suggest, even if this seems “inefficient” from an economic point of view. Social insurance is organized according to principles of solidarity, where access and coverage are
|
||
|
||
2.1 Insurance
|
||
|
||
27
|
||
|
||
independent of risk status, and sometimes of ability to pay (as noted by Mittra (2007)). Nevertheless, in many cases, the premium is proportional to the income of the policyholder, is usually provided by public rather than private entities. For some social goods, such as health care, long-term care, and perhaps even basic mortgage life insurance, it may simply be inappropriate to provide such products through a mutuality-based model that inevitably excludes some individuals, as “primary social goods, because they are defined as something to which everyone has an inalienable right, cannot be distributed through a system that excludes individuals based on their risk status or ability to pay.” Mutual insurance companies are often seen as an intermediary between such public insurance and for-profit insurance companies.
|
||
And as Lasry (2015) points out, “insurance has long been faced with a dilemma: on the one hand, better knowledge of a risk allows for better pricing; better knowledge of risk factors can also encourage prevention; on the other hand, mutualization, which is the basis of insurance, can only subsist in most cases in a situation of relative ignorance (or even a legal obligation of ignorance).” Actuaries will then seek to classify or segment risks, all based on the idea of mutualization. We shall return to the mathematical formalism of this dilemma. De Pril and Dhaene (1996) point out that segmentation is a technique that the insurer uses to differentiate the premium and possibly the cover, according to a certain number of specific characteristics of the risk of being the policyholder (hereinafter referred to as segmentation criteria), with the aim of achieving a better match between the estimated cost of the claim and the burdens that a given person places on the community of policyholders and the premium that this person has to pay for the cover offered. In Box 2.1, Rodolphe Bigot responds to general legal considerations regarding risk selection in insurance (in France, but most principles can be observed elsewhere). Underwriting is the term used to describe the decision-making process by which insurers determine whether to offer, or refuse, an insurance policy to an individual based on the available information (and the requested amount). Gandy (2016) asserts that “right to underwrite” is basically a right to discriminate. Hence, “higher premium” corresponds to a rating decision, “exclusion waiver” is a coverage decision whereas “denial” is an underwriting decision.
|
||
|
||
Box 2.1 Insurance & Underwriting (in French Law), by Rodolphe Bigot1 The insurance transaction and the underlying mutualization are based on socalled risk selection. Apart from most group insurance policies, which consist of a kind of mutualization within a mutualization, the insurer refuses or accepts that each applicant for insurance enters the mutualization constituted
|
||
(continued)
|
||
|
||
1 Lecturer in private law, UFR of Law, Le Mans University, member of Thémis-UM and Ceprisca.
|
||
|
||
28
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
Box 2.1 (continued) by the group of policyholders. This selection of risks “confines the mutualization to the policyholders accepted by the insurer, who is always considered, in insurance contract law, as the one who accepts the contract proposed to him by the applicant for insurance,” wrote (Monnet 2017, p. 13ff). In this respect, it should be recalled that the economics of the insurance transaction requires that the insurance company be given a great deal of freedom to accept or refuse the risk proposed to it. Proof of this freedom in the area of personal insurance is in the provisions of Article 225-3 of the Criminal Code, which exclude from the scope of application of the criminal repression of discrimination in the supply of goods and services provided for in Article 2252 “discrimination based on the state of health, when it consists of operations whose purpose is the prevention and coverage of risks of death, risks of harm to the physical integrity of the person, or risks of incapacity for work or disability.” However, not to admit a limit to this freedom of the insurer would lead to the evacuation of important social considerations and to the exclusion not only of insurance but also of the goods and services linked to it (such as borrowing, and therefore access to property) of the most exposed persons (Bigot and Cayol 2020, p. 540). The question of the right to insurance arises here (Pichard 2006). To this end, “having access to insurance means not only the very possibility of taking out a contract for coverage, but perhaps also at a reasonable economic cost, not prohibitive, not dissuasive. In societies where the need for security, or even comfort, is a leitmotif, the question is very relevant.” (Noguéro 2010, p. 633)
|
||
|
||
2.2 Premiums and Benefits
|
||
Comparing policyholders is always tricky, as not only do they potentially carry different risks but they may also have different preferences (and therefore choose different policies). First of all, it is important to distinguish between coverage. In a car insurance policy, the “third party liability” cover is the compulsory component, covering exclusively the damage that the insured car might cause to a third party. But some policyholders may want (or need) more extensive protection. Other standard types of cover include “comprehensive” cover, which covers all damage to the vehicle (regardless of the circumstances of the accident or the driver’s responsibility), “collision” cover, which reimburses the owner for damage caused to the vehicle in the event of a collision with a third party, and “fire and theft” cover, which compensates the owner of the vehicle if it is damaged or destroyed by fire, or if it is stolen. Some insurers also offer “mechanical breakdown” cover, which allows the insurance to compensate for the cost of repairs related to a
|
||
|
||
2.2 Premiums and Benefits
|
||
|
||
29
|
||
|
||
Fig. 2.1 Coverage selected by auto insurance policyholders based on age, with the basic mandatory coverage, “third party insurance” and the broadest coverage known as “fully comprehensive.” (Source: personal communication, real data from an insurance company in France)
|
||
breakdown, or “vehicle contents” cover, which offers compensation in the event of damage to or disappearance of items inside the insured vehicle. There may also be “assistance” cover, which provides services in the event of a breakdown, such as breakdown assistance, towing, repatriation, etc. Another possible source of difference is the indemnity, which may vary according to the choice of the deductible level (Buchanan and Priest 2006). As a reminder, the deductible is the amount that remains payable by the policyholder after the insurer has compensated for a loss. The absolute (or fixed) excess is the most common in car insurance: in a policy with an excess of e150, if the repair costs amount to e250, the insurance company will pay e100 and the remaining e150 will be paid by the policyholder. Many insurers now offer “mileage excesses,” defining a perimeter around the vehicle’s usual parking place: within this perimeter, the assistance guarantee will not work. However, if a breakdown occurs outside this perimeter, the assistance guarantee can be called upon.
|
||
Also, it is difficult to compare the auto insurance premium paid by different people. In Fig. 2.1, we can see that the choice of auto insurance coverage is strongly dependent on age, with young drivers opting overwhelmingly for compulsory coverage (one-third of drivers are between 20 and 25 years of age), and older drivers taking out more “comprehensive” insurance (90% of drivers are between 70 and 80 years of age). Choosing different coverage inevitably translates into higher bills, as older people may have a more expensive policy simply because they require more coverage.
|
||
As mentioned earlier, a natural idea is that each policyholder should be offered a premium that is proportional to the risk he or she represents, to avoid another company enticing this customer with a more attractive contract. This principle of “personalization” could be seen to have the virtues of fairness (as each individual pays according to the risk he or she passes on to the community) and can even be reconciled with the principle of mutualization: all that is needed (provided that the market is large enough) is to group individuals into mutuals that are homogeneous from the point of view of risk. This very general principle does not say anything
|
||
|
||
30
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
about how to build a fair tariff. A difficult task lies in the fact that insurers have incomplete information about their customers. It is well known that the observable characteristics of policyholders (which can be used in pricing) explain only a small proportion of the risks they represent. The only remedy for this imperfection is to self-select policyholders by differentiating the cover offered to them, i.e., a nonlinear scale linking the premium to be paid to the amount of the deductible accepted. As mentioned in the previous chapter, observe that there is a close analogy between this concept of “fair tariff” and “actuarial fairness,” or that of “equilibrium with signal” proposed by Spence (1974, 1976) to describe the functioning of certain labor markets. Riley (1975) proposed a more general model that could be applied to insurance markets, among others. Cresta and Laffont (1982) proved the existence of fair insurance rates for a single risk. Although the structure of equilibrium with signal is now well understood in the case of a one-dimensional parameter, the same cannot be said for cases where several parameters are involved. Kohlleppel (1983) gave an example of the non-existence of such an equilibrium in a model satisfying the natural extension of Spence’s hypotheses. As insurance is generally a highly competitive and regulated market, the insurer must use all the statistical tools and data at its disposal to build the best possible rates. At the same time, its premiums must be aligned with the company’s strategy and take into account competition. Because of the important role played by insurance in society, premiums are also scrutinized by regulators. They must be transparent, explainable, and ethical. Thus, pricing is not only statistical, it also carries strategic and societal issues. These different issues can push the insurer to offer fairer premiums in relation to a given variable. For example, the regulations require insurers to present fair premiums according to the gender of the policy holder given their strategies and to offer fair premiums according to age. Regardless of the reason why an insurance player must present fairer pricing in relation to a variable, it must be able to define, measure, and then mitigate the ethical bias of its pricing while preserving its consistency and performance.
|
||
|
||
2.3 Premium and Fair Technical Price
|
||
Definition 2.3 (Expected Value) Let Y be a discrete random variable, then .E[Y ] = yP[Y = y], y∈Y
|
||
whereas it is (absolutely) continuous with density f ,
|
||
.E[Y ] = yf (x)dy. Y
|
||
|
||
2.3 Premium and Fair Technical Price
|
||
|
||
31
|
||
|
||
See Feller (1957) or Denuit and Charpentier (2004) for more details about this quantity, that exists only if the sum or the integral is finite. Risks with infinite expected values exhibit unexpected properties. If this quantity exists, because of the law of large numbers (Proposition 3.1), this corresponds to the probabilistic counterpart of the average of n values y1, · · · , yn obtained as independent draws of random variable Y . Interestingly, as discussed in the next chapter, this quantity can be obtained as the solution of an optimization problem. More precisely,
|
||
|
||
n
|
||
|
||
.y = argmin
|
||
|
||
yi − m 2 and E[Y ] = argmin E [(Y − m)2 .
|
||
|
||
m∈R i=1
|
||
|
||
m∈R
|
||
|
||
2.3.1 Case of a Homogeneous Population
|
||
Before talking about segmentation, let us examine the case of a homogeneous portfolio, where the policyholders are faced with the same probability of occurrence of a claim, the same extent of damage, the same sum insured, etc., from o῾ μογενής (homogenes), “of the same race, family or kind,” (from o῾ μός, homos, “same,” and γένος, genos2). In nonlife insurance, the insurer undertakes (in exchange for the payment of a premium, the amount of which is decided when the contract is signed) to cover the claims that will occur during the year. If y is the annual cost of a randomly selected policyholder, then we define3 the “pure premium” as .E[Y ].
|
||
Definition 2.4 (Pure Premium (Homogeneous Risks)) Let Y be the non-negative random variable corresponding to the total annual loss associated with a given policy, then the pure premium is .E[Y ].
|
||
If we consider the risk of losing 100 with probability p (and nothing with probability .1 − p), the pure premium for this risk (economists would call it a lottery) is .100 p. The premium for an insurance contract is then proportional to the probability of having a claim. In the following, our examples are often limited
|
||
|
||
2 γένος is the etymological source of “gender,” and not “gene,” based on γενεά from the aorist infinitive of γίγνομαι—I come into being. 3 Denuit and Charpentier (2004) discuss the mathematical formalism that allows such a writing. In particular, the expectation is calculated according to a probability .P corresponding to the “historical” probability (there is no law of one price in insurance, contrary to the typical approach in market finance, in the sense of Froot et al. (1995). This will be discussed further in Sect. 3.1.1.). Insurance, like econometrics, is formalized in a probabilistic world, perhaps unlike many machine learning algorithms which are derived without a probabilistic model, as discussed in Charpentier et al. (2018).
|
||
|
||
32
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
to the estimation of this probability. In a personal insurance contract, the period of cover is longer, and it is necessary to discount future payments.4
|
||
|
||
.a = E
|
||
|
||
100 (1 + r)T
|
||
|
||
=
|
||
|
||
∞
|
||
|
||
100 (1 + r)t
|
||
|
||
· P[T
|
||
|
||
= t],
|
||
|
||
t =0
|
||
|
||
for some discount rate r. However, this assumption of homogeneous risks proves to be overly simplistic in numerous insurance scenarios. Take death insurance, for example, where the law of T , representing the occurrence of death, should ideally be influenced by factors such as the policyholder’s age at the time of contract. This specific aspect will be explored further in Sect. 2.3.3. But before, let us get back to classical concepts about economic decisions, when facing uncertain events.
|
||
|
||
2.3.2 The Fear of Moral Hazard and Adverse-Selection
|
||
In the context of insurance, moral hazard refers to the impact of insurance on incentives to reduce risks. An individual facing an accidental risk such as of the loss of a home, a car, or the risk of medical expenses can generally take actions to reduce the risk. Without insurance, the costs and benefits of accident avoidance, or precaution, are internal to the individual and the incentives for avoidance are optimal. With insurance, some of the accident costs are borne by the insurer, as recalled in Winter (2000).
|
||
Definition 2.5 (Adverse Selection (Laffont and Martimort 2002)) Adverse selection is a market situation where buyers and sellers have different information. “Adverse selection” characterizes principal-agent models in which an agent has private information before a contract is written.
|
||
Definition 2.6 (Moral Hazard (Arrow 1963)) In economics, a moral hazard is a situation where an economic actor has an incentive to increase its exposure to risk because it does not bear the full costs of that risk.
|
||
There have been many publications about adverse selection and moral hazard in life insurance, which creates a demand for insurance that correlates positively with the insured person’s risk of loss, and could be seen as immoral, or unethical. In Box 2.2, one of the oldest discussions about moral hazard, in Michelbacher (1926), is reproduced.
|
||
|
||
4 Without discounting, as death is (at an infinite time horizon) certain, the pure premium would be exactly the amount of the capital paid to the beneficiaries.
|
||
|
||
2.3 Premium and Fair Technical Price
|
||
|
||
33
|
||
|
||
Box 2.2 Moral Hazard, Michelbacher (1926) “Moral hazard is the Bogey Man who will catch the unwary insurance official who does not watch out. When insurance is under consideration he is always present in one guise or another, sometimes standing out in bold relief, but more often lurking in the background where he employs every expedient to avoid detection. In all the ramifications of insurance procedure, from the binding of the risk until the last moment of policy coverage has expired, his insidious influence may manifest itself, usually where it is least expected. In the other case his ignorance, carelessness, inattention or recklessness may involve the carrier in claims which the ordinarily prudent policyholder would avoid. The unsafe automobile driver; the employer whose attitude toward safety is not proper; the careless person who loves display and is notoriously lax in the protection of his jewelry: these and many others are “bad risks” for the insurance carrier because they prevent the proper functioning of the law of averages and introduce the certainty of loss into the insurance transaction. It will be noted that the term “moral hazard” as employed in this discussion is used in a much broader sense than the following definition, which is typical of common usage, would imply: “The hazard is the deflection or variation from the accepted standard of what is right in one’s conduct. Moral Hazard is that risk or chance due to the failure of the positive moral qualities of a person whose interests are affected by a policy of insurance.”
|
||
|
||
All actuaries have been lulled by Akerlof’s 1970 fable of “lemons”. The insurance market is characterized by information asymmetries. From the insurer’s point of view, these asymmetries mainly concern the need to find adequate information on the customer’s risk profile. A decisive factor in the success of an insurance business model is the insurer’s ability to estimate the cost of risk as accurately as possible. Although in the case of some simple product lines, such as motor insurance, the estimation of the cost of risk can be largely or fully automated and managed in-house; in areas with complex risks, the assistance of an expert third party can mitigate this type of information asymmetry. With Akerlof’s terminology, some insurance buyers are considered low-risk peaches, whereas others are highrisk lemons. In some cases, insurance buyers know (to some extent) whether they are lemons or peaches. If the insurance company could tell the difference between lemons and peaches, it would have to charge peaches a premium related to the risk of the peaches and lemons a premium related to the risk of the lemons, according to a concept of actuarial fairness, as Baker (2011) reminds us. But if actuaries are not able to differentiate between lemons and peaches, then they will have to charge the same price for an insurance contract. The main difference between the market described by Akerlof (1970) (in the original fable it was a market for used cars) and an insurance market is that the information asymmetry was initially (in the car example) in favor of the seller of an asset. In the field of insurance, the situation is
|
||
|
||
34
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
often more complex. In the field of car insurance, Dalziel and Job (1997) pointed out the optimism bias of most drivers who all think they are “good risks.” The same bias will be found in many other examples, as mentioned by Royal and Walls (2019), but excluding health insurance, where the policyholder may indeed have more information than the insurer.
|
||
To use the description given by Chassagnon (1996), let us suppose that an insurer covers a large number of agents who are heterogeneous in their probability of suffering a loss. The insurer proposes a single price that reflects the average probability of loss of the agent representative of this economy, and it becomes unattractive for agents whose probability of suffering an accident is low to insure themselves. A phenomenon of selection by price therefore occurs and it is said to be unfavorable because it is the bad agents who remain. To guard against this phenomenon of anti-selection, risk selection and premium segmentation are necessary. “Adverse-selection disappears when risk analysis becomes sufficiently effective for markets to be segmented efficiently, says (Picard 2003), doesn’t the difficulty econometricians have in highlighting real anti-selection situations in the car insurance market reflect the increasingly precise evaluation of the risks underwritten by insurers ?”
|
||
|
||
2.3.3 Case of a Heterogeneous Population
|
||
|
||
It is important to have models that can capture this heterogeneity (from ε῾τερογενής, heterogenes, “of different kinds,” from ε῞τερος, heteros, “other, another, different” and γένος, genos, “kinds”). To get back to our introductory example, if .Tx is the age at (random) death of the policyholder of age x at the time the contract was taken out (so that .Tx − x is the residual life span), then the pure premium corresponds to the expected present value of future flows, i.e.,
|
||
|
||
.ax = E
|
||
|
||
100 (1 + r)Tx−x
|
||
|
||
=
|
||
|
||
∞ t =0
|
||
|
||
100 (1 + r)t
|
||
|
||
· P[Tx
|
||
|
||
=x
|
||
|
||
+ t],
|
||
|
||
for a discount rate r. Using a more statistical terminology, it can be rewritten as
|
||
|
||
.ax
|
||
|
||
=
|
||
|
||
∞ t =0
|
||
|
||
100 (1 + r)t
|
||
|
||
·
|
||
|
||
Lx+t−1 − Lx+t , Lx +t −1
|
||
|
||
where .Lt is the number of people alive at the end of t years in a cohort that we would follow, so that .Lx+t−1 − Lx+t is the number of people alive at the end of .x +t −1 years but not .x +t years (and therefore dead in their t-th year). It is De Witt
|
||
(1671) who first proposed this premium for a life insurance, where discriminating
|
||
according to age seems legitimate.
|
||
|
||
2.4 Mortality Tables and Life Insurance
|
||
|
||
35
|
||
|
||
But we can go further, because .P[Tx = t], the probability that the policyholder of age x at the time of subscription will die in t years, could also depend on his or her gender, his or her health history, and probably on other variables that the insurer might know. And in this case, it is appropriate to calculate conditional probabilities, .P[Tx = t|woman] or .P[Tx = t|man smoker].
|
||
|
||
2.4 Mortality Tables and Life Insurance
|
||
To illustrate heterogeneity, let us continue with mortality tables, as many tables are public and openly available. The first modern mortality table was constructed in 1662 by John Graunt in London, and the first scientific mortality table was presented to the Royal Academy by Edmund Halley, in 1693, “Estimate of the Degree of Mortality of Mankind, drawn from curious tables of the births and funerals at the city of Breslau”. At first, tables were constructed on data obtained from general population statistics, namely the Northampton (also called “Richard Price” life table) and Carlisle tables, Milne (1815) or Gompertz (1825, 1833). In order to compute adequate premium rates, insurance companies began to keep accurate and reliable records of their own mortality experience. The first life table constructed on the basis of insurance data was completed in 1834 by actuaries of the Equitable Assurance of London, as discussed in Sutton (1874) and Nathan (1925). Later, American life insurance companies had the benefit of the English experience. As mentioned in Cassedy (2013), English mortality tables tended to overestimate death rates (both in the USA and in England, contributing to the prosperity of life insurance companies), and according to Zelizer (2018), The Presbyterian and Episcopalian Funds relied on the Scottish mortality experience (the first life table was constructed in 1775, as mentioned in Houston (1992)), whereas the Pennsylvania Company and the Massachusetts Hospital Life Insurance Company used the Northampton table. From the 1830s to the 1860s, American companies based their premiums on the Carlisle table. In 1868, Sheppard Homans (actuary of the Mutual Life Insurance Company) and George Phillips (Equitable’s actuary) produce the first comprehensive table of American mortality in Homans and Phillips (1868), named the “American Experience” table in Ransom and Sutch (1987).
|
||
|
||
2.4.1 Gender Heterogeneity
|
||
As surprising as it may seem, Pradier (2011) noted that before the end of the eighteenth century, in the UK and in France, the price of life annuities hardly ever depended on the sex of the subscriber. However, the first separate mortality tables, between men and women, constituted as early as 1740 by Nicolas Struyck (published in the appendices of a geography article, Struyck (1740)) showed that
|
||
|
||
36
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
Table 2.1 Excerpt from the Men and Women life tables in 1720 (Source: Struyck (1912), page 231), for pseudo-cohorts of one thousand people (.L0 = 1000)
|
||
|
||
Men
|
||
|
||
Women
|
||
|
||
x .Lx
|
||
|
||
.5 px
|
||
|
||
x .Lx .5px
|
||
|
||
x .Lx
|
||
|
||
.5 px
|
||
|
||
x .Lx .5px
|
||
|
||
0 1000 29.0% 45 371 16.6% 0 1000 28.9% 45 423 11.8%
|
||
|
||
5 710 5.6% 50 313 19.2% 5 711 5.2% 50 373 14.7%
|
||
|
||
10 670 4.2% 55 253 22.9% 10 674 3.3% 55 318 18.2%
|
||
|
||
15 642 5.5% 60 195 27.2% 15 652 4.3% 60 260 21.2%
|
||
|
||
20 607 6.6% 65 142 31.7% 20 624 5.8% 65 205 26.8%
|
||
|
||
25 567 7.9% 70 97 37.1% 25 588 6.8% 70 150 33.3%
|
||
|
||
30 522 9.2% 75 61 45.9% 30 548 7.3% 75 100 45.0%
|
||
|
||
35 474 10.5% 80 33 51.5% 35 508 7.9% 80 55 56.4%
|
||
|
||
40 424 12.5% 85 16
|
||
|
||
40 468 9.6% 85 24
|
||
|
||
women generally lived longer than men (Table 2.1). Struyck (1740) (translated in Struyck (1912)) shows that at age 20, life expectancy (residual) is 30 years .3/4 for men and 35 years .1/2 for women. It also provides life annuity tables by gender. For a 50-year-old woman, a life annuity was worth 969 florins, compared with 809 florins for a man of the same age. This substantial difference seemed to legitimize a differentiation of premiums. Here, 424 men (.Lx) and 468 women (out of one thousand respective births) had reached 40 years of age (.x = 40). And among those who had reached 40 years of age, 12.5% of men and 9.6% of women would die within 5 years (mathematically denoted .5px = P[T ≤ x + 5|T > x]).
|
||
According to Pradier (2012), it was not until the Duchy of Calenberg’s widows’ fund went bankrupt in 1779 that the age and sex of subscribers were used in conjunction to calculate annuity prices. In France, in 1984, the regulatory authorities of the insurance markets decided to use regulatory tables established for the general population by INSEE, based on the population observed over 4 years, namely the PM 73-77 table for men and the PF 73-77 table for women, renamed TD and TV 73-77 tables respectively (with an analytical extension beyond 99 years). Although the primary factor in mortality is age, gender is also an important factor, as shown in the TD-TV Table. For more than a century, the mortality rate for men has been higher than that of women in France.
|
||
In practice, however, the actuarial pricing of life insurance policies has continued to be established without taking into account the gender of the policyholder. In fact, the reason why two tables were used was that the male table was the regulatory table for life insurance (PM became TD, for “table de décès,” or “death table”), and the female table became the table for life insurance (PF became TV, for “table de vie,” or “life table”). In 1993, the TD and TV 88-90 tables replaced the two previous tables, with the same principle, i.e., the use of a table built on a male population for life insurance, and a table built on a female population for life insurance. From a prudential point of view, the female table models a population that has, on average, a lower mortality rate, and therefore lives longer.
|
||
|
||
2.4 Mortality Tables and Life Insurance
|
||
|
||
37
|
||
|
||
In 2005, the TH and TF 00-02 tables were used as regulatory tables, still with tables founded on different populations, namely men and women respectively. But this time, the term men (H, for hommes) and women (F, for femmes) is maintained, as regulations allowed for the possibility of different pricing for men and women. A ruling by the Court of Justice of the European Union on 1 March 2011, however, made gender-differentiated pricing impossible (as of 21 December 2012), on the grounds that they would discriminate. In comparison, recent (French) INED tables are also mentioned in Table 2.2, on the right-hand side.
|
||
|
||
2.4.2 Health and Mortality
|
||
Beyond gender, all sorts of “discriminating variables” have been studied, in order to build, for example, mortality tables depending on whether the person is a smoker or not, as in Benjamin and Michaelson (1988), in Table 2.3. Indeed, since Hoffman (1931) or Johnston (1945), actuaries had observed that exposure to tobacco, and smoking, had an important impact on the policyholder’s health. As Miller and Gerstein (1983) wrote, “it is clear that smoking is an important cause of mortality.”
|
||
There are also mortality tables (or calculations of residual life expectancy) by level of body mass index (BMI, introduced by Adolphe Quetelet in the midnineteenth century), as calculated by Steensma et al. (2013) in Canada. A “normal” index refers to people with an index between .18.5 and .25 kg/m2; “overweight” refers to an index between 25 and .30 kg/m2; obesity level I refers to an index between 30 and .35 kg/m2, and obesity level II refers to an index exceeding . 35kg/m2. Table shows some of the elements. These orders of magnitude are comparable with Fontaine et al. (2003) among the pioneering studies, Finkelstein et al. (2010), or more recently Stenholm et al. (2017). If Adolphe Quetelet introduced that index, it became popular in the 1970s, when “Dr. Keys was irritated that life insurance companies [that] were estimating people’s body fat—and hence, their risk of dying—by comparing their weights with the average weights of others of the same height, age and gender,” as Callahan (2021) explains. In Keys et al. (1972), with “more than 7000 healthy, mostly middle-aged men, Dr. Keys and his colleagues showed that the body mass index was a more accurate—and far simpler—predictor of body fat than the methods used by the insurance industry.” Nevertheless, this measure is now known to have many flaws, as explained in Ahima and Lazar (2013) (Table 2.4).
|
||
|
||
2.4.3 Wealth and Mortality
|
||
Higher incomes are associated with longer life expectancy, as mentioned already in Kitagawa and Hauser (1973) with probably the first documented analysis. But despite the importance of socioeconomic status to mortality and survival, Yang
|
||
|
||
38
|
||
|
||
Table 2.2 Excerpt from French tables, with TD and TV 73-77 on the left-hand side, TD and TV 88-90 in the center, and INED 2017-2019 on the right-hand side
|
||
|
||
TD 73-77
|
||
|
||
0
|
||
|
||
100000
|
||
|
||
10
|
||
|
||
97961
|
||
|
||
20
|
||
|
||
97105
|
||
|
||
30
|
||
|
||
95559
|
||
|
||
40
|
||
|
||
93516
|
||
|
||
50
|
||
|
||
88380
|
||
|
||
60
|
||
|
||
77772
|
||
|
||
70
|
||
|
||
57981
|
||
|
||
80
|
||
|
||
28364
|
||
|
||
90
|
||
|
||
4986
|
||
|
||
100
|
||
|
||
103
|
||
|
||
110
|
||
|
||
0
|
||
|
||
TV 73-77
|
||
|
||
0
|
||
|
||
100000
|
||
|
||
10
|
||
|
||
98447
|
||
|
||
20
|
||
|
||
98055
|
||
|
||
30
|
||
|
||
97439
|
||
|
||
40
|
||
|
||
96419
|
||
|
||
50
|
||
|
||
94056
|
||
|
||
60
|
||
|
||
89106
|
||
|
||
70
|
||
|
||
78659
|
||
|
||
80
|
||
|
||
52974
|
||
|
||
90
|
||
|
||
14743
|
||
|
||
100
|
||
|
||
531
|
||
|
||
110
|
||
|
||
0
|
||
|
||
TD 88-90
|
||
|
||
0
|
||
|
||
100000
|
||
|
||
10
|
||
|
||
98835
|
||
|
||
20
|
||
|
||
98277
|
||
|
||
30
|
||
|
||
96759
|
||
|
||
40
|
||
|
||
94746
|
||
|
||
50
|
||
|
||
90778
|
||
|
||
60
|
||
|
||
81884
|
||
|
||
70
|
||
|
||
65649
|
||
|
||
80
|
||
|
||
39041
|
||
|
||
90
|
||
|
||
9389
|
||
|
||
100
|
||
|
||
263
|
||
|
||
110
|
||
|
||
0
|
||
|
||
TV 88-90
|
||
|
||
0
|
||
|
||
100000
|
||
|
||
10
|
||
|
||
99129
|
||
|
||
20
|
||
|
||
98869
|
||
|
||
30
|
||
|
||
98371
|
||
|
||
40
|
||
|
||
97534
|
||
|
||
50
|
||
|
||
95752
|
||
|
||
60
|
||
|
||
92050
|
||
|
||
70
|
||
|
||
84440
|
||
|
||
80
|
||
|
||
65043
|
||
|
||
90
|
||
|
||
24739
|
||
|
||
100
|
||
|
||
1479
|
||
|
||
110
|
||
|
||
2
|
||
|
||
INED men
|
||
|
||
0
|
||
|
||
100000
|
||
|
||
10
|
||
|
||
99486
|
||
|
||
20
|
||
|
||
99281
|
||
|
||
30
|
||
|
||
98656
|
||
|
||
40
|
||
|
||
97661
|
||
|
||
50
|
||
|
||
95497
|
||
|
||
60
|
||
|
||
90104
|
||
|
||
70
|
||
|
||
78947
|
||
|
||
80
|
||
|
||
59879
|
||
|
||
90
|
||
|
||
25123
|
||
|
||
100
|
||
|
||
1412
|
||
|
||
INED women
|
||
|
||
0
|
||
|
||
100000
|
||
|
||
10
|
||
|
||
99578
|
||
|
||
20
|
||
|
||
99471
|
||
|
||
30
|
||
|
||
99247
|
||
|
||
40
|
||
|
||
98810
|
||
|
||
50
|
||
|
||
97645
|
||
|
||
60
|
||
|
||
94777
|
||
|
||
70
|
||
|
||
89145
|
||
|
||
80
|
||
|
||
77161
|
||
|
||
90
|
||
|
||
44236
|
||
|
||
100
|
||
|
||
4874
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
2.4 Mortality Tables and Life Insurance
|
||
|
||
39
|
||
|
||
Table 2.3 Residual life expectancy (in years) by age (25–65 years) for smokers and nonsmokers (Source: Benjamin and Michaelson (1988), for 1970–1975 data in the USA)
|
||
|
||
Men
|
||
|
||
Nonsmoker
|
||
|
||
Smoker
|
||
|
||
25
|
||
|
||
48.4
|
||
|
||
42.8
|
||
|
||
35
|
||
|
||
38.7
|
||
|
||
33.3
|
||
|
||
45
|
||
|
||
29.2
|
||
|
||
24.2
|
||
|
||
55
|
||
|
||
20.3
|
||
|
||
16.5
|
||
|
||
65
|
||
|
||
12.8
|
||
|
||
10.4
|
||
|
||
Women
|
||
|
||
Nonsmoker
|
||
|
||
Smoker
|
||
|
||
25
|
||
|
||
52.8
|
||
|
||
49.8
|
||
|
||
35
|
||
|
||
43.0
|
||
|
||
40.1
|
||
|
||
45
|
||
|
||
33.5
|
||
|
||
31.0
|
||
|
||
55
|
||
|
||
24.5
|
||
|
||
22.6
|
||
|
||
65
|
||
|
||
16.2
|
||
|
||
15.1
|
||
|
||
Table 2.4 Residual life expectancy (in years), as a function of age (between 20 and 70 years) as a function of BMI level (Source: Steensma et al. (2013))
|
||
|
||
Men
|
||
|
||
Normal Overweight Obese I Obese II
|
||
|
||
20 57.2 61.0
|
||
|
||
59.1 53.5
|
||
|
||
30 47.6 51.4
|
||
|
||
49.4 44.1
|
||
|
||
40 38.1 41.7
|
||
|
||
39.9 34.7
|
||
|
||
50 28.9 32.4
|
||
|
||
30.6 25.8
|
||
|
||
60 20.4 23.6
|
||
|
||
21.9 17.6
|
||
|
||
70 13.2 15.8
|
||
|
||
14.4 10.9
|
||
|
||
Women
|
||
|
||
Normal Overweight Obese I Obese II
|
||
|
||
20 62.8 66.5
|
||
|
||
64.6 59.3
|
||
|
||
30 53.0 56.7
|
||
|
||
54.8 49.5
|
||
|
||
40 43.3 46.9
|
||
|
||
45.0 39.9
|
||
|
||
50 33.8 37.3
|
||
|
||
35.5 30.6
|
||
|
||
60 24.9 28.1
|
||
|
||
26.4 21.9
|
||
|
||
70 16.8 19.7
|
||
|
||
18.2 14.3
|
||
|
||
Table 2.5 Excerpt of life tables per wealth quantile and gender in France (Source: Blanpain (2018))
|
||
|
||
Men
|
||
|
||
0–5% 45–50% 95–100%
|
||
|
||
0 100000 100000 100000
|
||
|
||
10 99299 99566 99619
|
||
|
||
20 99024 99396 99469
|
||
|
||
30 97930 98878 99094
|
||
|
||
40 95595 98058 98627
|
||
|
||
50 90031 96172 97757
|
||
|
||
60 77943 91050 95649
|
||
|
||
70 59824 79805 90399
|
||
|
||
80 38548 59103 76115
|
||
|
||
90 13337 23526 38837
|
||
|
||
100
|
||
|
||
530
|
||
|
||
1308
|
||
|
||
3231
|
||
|
||
Women
|
||
|
||
0–5% 45–50% 95–100%
|
||
|
||
0 100000 100000 100000
|
||
|
||
10 99385 99608 99623
|
||
|
||
20 99227 99506 99526
|
||
|
||
30 98814 99302 99340
|
||
|
||
40 97893 98960 99074
|
||
|
||
50 95021 97959 98472
|
||
|
||
60 88786 95543 97192
|
||
|
||
70 79037 90408 94146
|
||
|
||
80 63224 79117 85825
|
||
|
||
90 31190 45750 55918
|
||
|
||
100
|
||
|
||
2935
|
||
|
||
5433
|
||
|
||
8717
|
||
|
||
et al. (2012), Chetty et al. (2016), and Demakakos et al. (2016) stressed that wealth has been under-investigated as a predictor of mortality. Duggan et al. (2008) and Waldron (2013) used social security data in the USA. In France, disparities of life expectancy by social categories are well known. Recently, Blanpain (2018) created some life tables per wealth quantiles. An excerpt can be visualized on Table 2.5, with men on the left-hand side, women on the right-hand side, and fictional cohorts, with the poorest 5% of the population on the left-hand side (“0–5%”) and the richest
|
||
|
||
40
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
Fig. 2.2 Force of mortality (log scale) for men on the left-hand side and women on the right-hand side, for various income quantiles (bottom, medium, and upper 10%), in France. (Data source: Blanpain (2018))
|
||
5% on the right-hand side (“95–100%”). Force of mortality, as a function of the age, the gender, and the wealth quantile, can be visualized in Fig. 2.2.
|
||
2.5 Modeling Uncertainty and Capturing Heterogeneity
|
||
2.5.1 Groups of Predictive Factors
|
||
A multitude of criteria can be used to create rate classes, as we have seen in the context of mortality. To get a good predictive model, as in standard regression models, we simply look for variables that correlate significantly with the variable of interest, as mentioned by Wolthuis (2004). For instance, in the case of car insurance, the following information was proposed in Bailey and Simon (1959): use (leisure—pleasure—or professional—business), age (under 25 or not), gender and marital status (married or not). Specifically, five risk classes are considered, with rate surcharges relative to the first class (which is used here as a reference):
|
||
– “pleasure, no male operator under 25,” (reference), – “pleasure, nonprincipal male operator under 25,” .+65%, – “business use,” .+65%, – “married owner or principal operator under 25,” .+65%, – “unmarried owner or principal operator under 25,” .+140%.
|
||
In the 1960s, the rate classes resembled those that would be produced by classification (or regression) trees such as those introduced by Breiman et al. (1984). But using more advanced algorithms, Davenport (2006) points out that when an actuary creates risk classes and rate groups, and in most cases, these “groups” are not self-aware, they are not conscious (at most, the actuary will try to describe
|
||
|
||
2.5 Modeling Uncertainty and Capturing Heterogeneity
|
||
|
||
41
|
||
|
||
them by looking at the averages of the different variables). These groups, or risk classes, are built on the basis of available data, and exist primarily as the product of actuarial models. And as Gandy (2016) points out, there is no “physical basis” for group members to identify other members of their group, in the sense that they usually don’t share anything, except some common characteristics. As discussed in Sect. 3.2, these risk groups, developed at a particular point in time, create a transient collusion between policyholders, who are likely to change groups as they move, change cars, or even simply grow older.
|
||
|
||
2.5.2 Probabilistic Models
|
||
|
||
Consider here a probabilistic space .(Ω, F, P), where .F is a set of “events” on .Ω (.A ∈ F is an “event”). Recall briefly that .P is a function .F → [0, 1] satisfying some properties, such as .P(Ω) = 1; for disjoint events, an “additivity property”: .P(A ∪ B) = P(A) + P(B); a “subset property” (or “inclusion property”), if .A ⊂ B, .P(A) ≤ P(B), as in Cardano (1564) or Bernoulli (1713), or for multiple (possibly infinite) disjoint events as in Kolmogorov (1933), .A1, · · · , An, · · · ,
|
||
.P(A1 ∪ · · · ∪ An ∪ · · · ) = P(A1) + · · · + P(An) + · · ·
|
||
inspired by Lebesgue (1918), etc. In Sect. 3.1.1 we will return to probability measures, as they are extremely important in assessing how well calibrated the model is, as well as how fair it is. But in this section, we need to recall two important properties that are crucial to model heterogeneity.
|
||
Proposition 2.1 (Total Probability) If .(Bi)i∈I is a partition of .Ω (an exhaustive (finite or countable) set of disjoint events),
|
||
|
||
.P(A) = P(A ∩ Bi) = P(A|Bi) · P(Bi),
|
||
|
||
i∈I
|
||
|
||
i∈I
|
||
|
||
where, by definition, .P(A|Bi) denotes the conditional probability of the occurrence of .A, given that .Bi occurred.
|
||
|
||
Proof See Feller (1957) or Ross (1972).
|
||
|
||
⨅⨆
|
||
|
||
An immediate consequence is the law of total expectations.
|
||
|
||
Proposition 2.2 (Total Expectations) For any measurable random variable Y with finite expectation, if .(Bi)i∈I is a partition of .Ω
|
||
|
||
.E(Y ) = E(Y |Bi ) · P(Bi ). i∈I
|
||
|
||
42
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
Proof See Feller (1957) or Ross (1972).
|
||
|
||
⨅⨆
|
||
|
||
This formula can be written simply in the case where two sets, two subgroups, are considered, for example, related to the gender of the individual,
|
||
|
||
.E(Y ) = E(Y |woman) · P(woman) + E(Y |man) · P(man).
|
||
|
||
If Y denotes the life expectancy at the birth of an individual, the literal translation of the previous expression is that the life expectancy at birth of a randomly selected individual (on the left) is a weighted average of the life expectancies at birth of females and males, the weights being the respective proportions of males and females in the population. And as .E(Y ) is an average of the two,
|
||
. min{E(Y |woman), E(Y |man)} ≤ E(Y ) ≤ max{E(Y |woman), E(Y |man)};
|
||
|
||
in other words, treating the population as homogeneous, when it is not, means that one group is subsidized by the other, which is called “actuarial unfairness,” as discussed by Landes (2015), Frezal and Barry (2019), or Heras et al. (2020). The greater the difference between the two conditional expectations, the greater the unfairness. This “unfairness” is also called “cross-financing” as one group will subsidize the other one.
|
||
Definition 2.7 (Pure Premium (Heterogeneous Risks)) Let Y be the nonnegative random variable corresponding to the total annual loss associated with a given policy, with covariates .x, then the pure premium is .μ(x) = E[Y |X = x].
|
||
We use notation .μ(x), also named “regression function” (see Definition 3.1). We also use notations .EY [Y ] (for .E[Y ]) and .EY |X[Y |X = x] (for .E[Y |X = x]) to emphasize the measure used to compute the expected value (and to avoid confusion). For example, we can write
|
||
|
||
.EY [Y ] = =
|
||
|
||
yfY (Y )dy and EY |X[Y |X = x] =
|
||
R
|
||
y fY,X(y, x) dy. R fX(x)
|
||
|
||
yfY |X(y|x)dy
|
||
R
|
||
|
||
The law of total expectations (Proposition 2.2) can be written, with that notation
|
||
|
||
.EY [Y ] = EX EY |X[Y |X] .
|
||
|
||
An alternative is to write, with synthetic notations .E[Y ] = E E[Y |X] , where the same notation—.E—is used indifferently to describe the same operator on different probability measures.
|
||
|
||
2.5 Modeling Uncertainty and Capturing Heterogeneity
|
||
|
||
43
|
||
|
||
The law of total expectations can be written
|
||
.EY [Y ] = EX EY |X[Y |X] = EX μ(X) ,
|
||
which is a desirable property we want to have on any pricing function m (also called “globally unbiased,” see Definition 4.26).
|
||
Definition 2.8 (Balance Property) A pricing function m satisfies the balance property if .EX[m(X)] = EY [Y ].
|
||
The name “balance property” comes from accounting, as we want assets (what comes in, or premiums, .m(x)) to equal liabilities (what goes out, or losses y) on average. This concept, as it appears in economics in Borch (1962), corresponds to “actuarial fairness,” and is based on a match between the total value of collected premiums and the total amount of legitimate claims made by the policyholder. As it is impossible for the insurer to know what future claims will actually be like, it is considered actuarially fair to set the level of premiums on the basis of the historical claims record of people in the same (assumed) risk class. It is on this basis that discrimination is considered “fair” in distributional terms, as explained in Meyers and Van Hoyweghen (2018). Otherwise, the redistribution would be considered “unfair,” with forced solidarity from the low-risk group to the highrisk group. This “fairness” was undermined in the 1980s, when private insurers limited access to insurance for people with AIDS, or at risk of developing it, as Daniels (1990) recalls. Feiring (2009) goes further in the context of genetic information, “since the individual has no choice in selecting his genotype or its expression, it is unfair to hold him responsible for the consequences of the genes he inherits—just as it is unfair to hold him responsible for the consequences of any distribution of factors that are the result of a natural lottery.” In the late 1970s (see Boonekamp and Donaldson (1979), Kimball (1979) or Maynard (1979)), the idea that the proportionality between the premium and the risk incurred would guarantee fairness between policyholders began to be translated into conditional expectation (conditional on the risk factors retained).
|
||
As discussed in Meyers and Van Hoyweghen (2018), who trace the emergence of actuarial fairness from its conceptual origins in the early 1960s to its position at the heart of insurance thinking in the 1980s, the concept of “actuarial fairness” appeared as more and more countries adopted anti-discrimination legislation. At that time, insurers positioned “actuarial fairness” as a fundamental principle that would be jeopardized if the industry did not benefit from exemptions to such legislation. For instance, according to the Equality Act 2010 in the U.K. “it is not a contravention (...) to do anything in connection with insurance business if (a) that thing is done by reference to information that is both relevant to the assessment of the risk to be insured and from a source on which it is reasonable to rely, and (b) it is reasonable to do that thing ,” as Thomas (2017) wrote.
|
||
In most applications, there is a strong heterogeneity within the population, with respect to risk occurrence and risk costs. For example, when modeling mortality, the probability of dying within a given year can be above 50% for very old and
|
||
|
||
44
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
sick people, and less than 0.001% for pre-teenagers. Formally, the heterogeneity will be modeled by a latent factor .Θ. If y designates the occurrence (or not) of an accident, y is seen as the realization of a random variable Y , which follows a Bernoulli distribution, .B(Θ), where .Θ is a non-observable latent variable (as in Gourieroux (1999) or Gourieroux and Jasiak (2007)). If y denotes the number of accidents occurring during the year, Y follows a Poisson distribution, .P(Θ) (or a binomial-negative model, or a parametric model with inflation of zeros, etc., as in Denuit et al. (2007)). If y notes the total cost, Y follows a Tweedie distribution, or more generally a compound Poisson distribution, which we denote by .L(Θ, ϕ), where .L denotes a distribution with mean .Θ, and where .ϕ is a dispersion parameter (see Definition 3.13 for more details). The goal of the segmentation is to constitute ratemaking classes (denoted .Bi previously) in an optimal way, i.e., by ensuring that one class does not subsidize the other, from observable characteristics, noted .x = (x1, x2, · · · , xk). Crocker and Snow (2013) speaks of “categorization based on immutable characteristics.” For Gouriéroux (1999), it is the “static partition” used to constitute sub-groups of homogeneous risks (“in a given class, the individual risks are independent, with identical distributions”). This is what a classification or regression tree does, the .Bi’s being the leaves of the tree, with the previous probabilistic notations. If y designates the occurrence of an accident, or the annual (random) load, the actuary tries to approximate .E[Y |X], from training data. In an econometric approach, if y designates the occurrence (or not) of an accident, and if .x designates the set of observable characteristics of the policyholder, .Y |X = x follows a Bernoulli distribution, .B(px ), for example,
|
||
|
||
.px
|
||
|
||
=
|
||
|
||
1
|
||
|
||
exp(x ⊤ β ) + exp(x⊤β)
|
||
|
||
or
|
||
|
||
px
|
||
|
||
=
|
||
|
||
Ф(x ⊤ β ),
|
||
|
||
for a logistic or probit regression respectively.5 If Y designates the number of accidents that occurred during the year, .Y |X = x follows a Poisson distribution, .P(λx), with typically .λx = exp(x⊤β). If Y denotes the annual cost, .Y |X = x follows a Tweedie distribution, or more generally a compound Poisson distribution, .L(μx, ϕ), where .L denotes a distribution of mean .μ, with .μx = E[Y |X = x] (for more details, Denuit and Charpentier (2004, 2005)).
|
||
To return to the analysis of De Wit and Van Eeghen (1984), detailed in Denuit and
|
||
Charpentier (2004), if we assume that the risks are homogeneous, the pure premium will be .E[Y ], and we have the risk-sharing table, Table 2.6. Without purchasing insurance, policyholders face a random loss Y . With insurance, policyholders face a fixed loss .E[Y ]. The risk is transferred to the insurance company, which faces a random loss .Y − E[Y ]. On average, the loss for the insurance company is null, and all the risk is carried by the insurer.
|
||
At the other extreme, if the latent risk factor .Θ were observable, the requested pure premium would be .E[Y |Θ], and we would have the split of Table 2.7.
|
||
|
||
5 .Ф is here the distribution function of the centered and reduced normal distribution, .N(0, 1).
|
||
|
||
2.5 Modeling Uncertainty and Capturing Heterogeneity
|
||
|
||
45
|
||
|
||
Table 2.6 Individual loss, its expected value, and its variables, for the policyholder on the lefthand side and the insurer on the right-hand side. .E[Y ] is the premium paid, and Y the total loss, from De Wit and Van Eeghen (1984) and Denuit and Charpentier (2004)
|
||
|
||
Loss Average loss Variance
|
||
|
||
Policyholder .E[Y ] .E[Y ] 0
|
||
|
||
Insurer .Y − E[Y ] 0 .Var[Y ]
|
||
|
||
Table 2.7 Individual loss, its expected value and its variables, for the policyholder on the lefthand side and the insurer on the right-hand side. .E[Y |Θ] is the premium paid, and Y the total loss, from De Wit and Van Eeghen (1984) and Denuit and Charpentier (2004)
|
||
|
||
Loss Average loss Variance
|
||
|
||
Policyholder .E[Y |Θ] .E[Y ] .Var[E[Y |Θ]]
|
||
|
||
Insurer .Y − E[Y |Θ] 0 .Var[Y − E[Y |Θ]]
|
||
|
||
Table 2.8 Individual loss, its expected value, and its variables, for the policyholder on the lefthand side, and the insurer on the right-hand side. .E[Y |X] is the premium paid, and Y the total loss, from De Wit and Van Eeghen (1984) and Denuit and Charpentier (2004)
|
||
|
||
Loss Average loss Variance
|
||
|
||
Policyholder .E[Y |X] .E[Y ] .Var[E[Y |X]]
|
||
|
||
Insurer .Y − E[Y |X] 0 .E[Var[Y |X]]
|
||
|
||
Proposition 2.3 (Variance Decomposition (1)) For any measurable random variable Y with finite variance
|
||
|
||
.Var[Y ] = E[Var[Y |Θ]] + Var[E[Y |Θ]] .
|
||
|
||
→ insurer
|
||
|
||
→ policyholder
|
||
|
||
Proof See Denuit and Charpentier (2004).
|
||
|
||
⨅⨆
|
||
|
||
Finally, using only observable features, denoted .x = (x1, x2, · · · , xk), we would have the decomposition of Table 2.8.
|
||
|
||
Proposition 2.4 (Variance Decomposition (2)) For any measurable random variable Y with finite variance
|
||
|
||
.Var[Y ] = E[Var[Y |X]] + Var[E[Y |X]],
|
||
|
||
→ insurer
|
||
|
||
→ policyholder
|
||
|
||
46
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
where
|
||
|
||
.E[Var[Y |X]] = E[E[Var[Y |Θ]|X]] + E[Var[E[Y |Θ]|X]]
|
||
|
||
= E[Var[Y |Θ]] + E{Var[E[Y |Θ]|X]} .
|
||
|
||
perfect ratemaking
|
||
|
||
misclassification
|
||
|
||
Proof See Denuit and Charpentier (2004).
|
||
|
||
⨅⨆
|
||
|
||
This “misclassification” term (on the right) is called “subsidierende solidariteit” in De Pril and Dhaene (1996), or “subsidiary solidarity”, as opposed to “ kanssolidariteit” or “random solidarity” term (on the left). As certainty replaces uncertainty, it will never disappear, at least for as long as it is a matter of predicting what may happen during a year at the time of subscription. For Corlier (1998), segmentation “decreases the solidarity of risks belonging to different segments.” And Löffler et al. (2016), citing a McKinsey report on the future of the insurance industry, mentioned that massive data will inevitably lead to de-mutualization and an increased focus on prediction. Nevertheless, Lemaire et al. (2016) suggested some practical limits, as “this process of segmentation, the sub-division of a portfolio of drivers into a large number of homogeneous rating cells, only ends when the cost of including more risk factors exceeds the profit that the additional classification would create, or when regulators rule out new variables.”
|
||
The link with solidarity is discussed in Gollier (2002), which reminds us that solidarity is fundamentally about making transfers in favor of disadvantaged people, compared with advantaged people, as discussed in the introduction to this chapter. But a very limited version of solidarity is taken into account in the context of insurance: “solidarity in insurance means deciding not to segment the corresponding risk market on the basis of the observable characteristics of individuals’ risks,” as in health insurance or unemployment insurance. It should be noted that although historically, the .x variables were discretized in order to make “tariff classes,” it is now conventional to consider continuous variables as such, or even to transform them, while maintaining a relative regularity. According to De Wit and Van Eeghen (1984), in the past, it used to be very difficult to discover risk factors both in a qualitative and in a quantitative sense: “solidarity was therefore, unavoidably, considerable. But recent developments have changed this situation: with the help of computers it has become possible to make thorough risk analyses, and consequently to arrive at further premium differentiation.”
|
||
Again, the difficulty with pricing is that this underlying risk factor .Θ is not observable. Not capturing it would lead to unfairness, as it would unduly subsidize the “riskier” (likely to have more expensive claims) individuals with the “less risky.” Baker and Simon (2002) went further, arguing that the reason why some people are classified as “low risk” and others as “high risk” is irrelevant. Speaking of automating accountability, Baker and Simon (2002) argued that it was important to make people accountable for the risk that they bring to mutuality, especially the riskiest policyholders, in order for the least risky policyholder to “feel morally
|
||
|
||
2.5 Modeling Uncertainty and Capturing Heterogeneity
|
||
|
||
47
|
||
|
||
comfortable” (as Stone (1993) put it). The danger is that, in this way, the allocation of each person’s contributions to mutuality would be the result of an actuarial calculation, as Stone (1993) put it. Porter (2020) said that this process was “a way of making decisions without seeming to decide.” We review this point when we discuss exclusions and the interpretability of models. The insurer then uses proxies to capture this heterogeneity, as we have just seen. A proxy (one might call it a “proxy variable”) is a variable that is not significant in its own right, but which replaces a useful but unobservable, or unmeasurable, variable, according to Upton and Cook (2014).
|
||
Most of our discussion focuses on tariff discrimination, and more precisely on the “technical” tariff. As mentioned in the introduction, from the point of view of the policyholder, this is not the most relevant variable. Indeed, in addition to the actuarial premium (the pure premium mentioned earlier), there is a commercial component, as an insurance agent may decide to offer a discount to one policyholder or another, taking into account a different risk aversion or a greater or lesser price elasticity (see Meilijson 2006). But an important underlying question is “is the provided service the same?” Ingold and Soper (2016) review the example of Amazon not offering the same services to all its customers, in particular sameday-delivery offers, offered in certain neighborhoods, chosen by an algorithm that ultimately reinforced racial bias (by never offering same-day delivery in neighborhoods composed mainly of minority groups). A naive reading of prices on Amazon would be biased because of this important bias in the data, which should be taken into account. As Calders and Žliobaite (2013) reminds us, “unbiased computational processes can lead to discriminative decision procedures.” In insurance, one could imagine that a claims manager does not offer the same compensation to people with different profiles—some people being less likely to dispute than others. It is important to better understand the relationship between the different concepts.
|
||
|
||
2.5.3 Interpreting and Explaining Models
|
||
A large part of the actuary’s job is to motivate, and explain, a segmentation. Some authors, such as Pasquale (2015), Castelvecchi (2016), or Kitchin (2017), have pointed out that machine-learning algorithms are characterized by their opacity and their “incomprehensibility,” sometimes called “black box” (or opaque) properties. And it is essential to explain them, to tell a story. For Rubinstein (2012), as mentioned earlier, models are “fables”: “economic theory spins tales and calls them models. An economic model is also somewhere between fantasy and reality (...) the word model sounds more scientific than the word fable or tale, but I think we are talking about the same thing.” In the same way, the actuary will have to tell the story of his or her model, before convincing the underwriting and insurance agents to adopt it. But this narrative is necessarily imprecise. As Saint Augustine said, “What is time? If no one asks me, I know. But if someone asks me and I want to explain it, then I don’t know anymore.”
|
||
|
||
48
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
Fig. 2.3 The evolution of auto insurance claim frequency as a function of primary driver age, relative to overall annual frequency, with a Poisson regression in yellow, a smoothed regression in red, a smoothed regression with a too small smoothing bandwidth in blue, and with a regression tree in green. Dots on the left are the predictions for a 22-year-old driver. (Data source: CASdataset R package, see Charpentier (2014))
|
||
One can hear that age must be involved in the prediction of the frequency of claims in car insurance, and indeed, as we see in Fig. 2.3, the prediction will not be the same at 18, 25, or 55 years of age. Quite naturally, a premium surcharge for young drivers can be legitimized, because of their limited driving experience, coupled with unlearned reflexes. But this story does not tell us at what order of magnitude this surcharge would seem legitimate. Going further, the choice of model is far from neutral on the prediction: for a 22-year-old policyholder, relatively simple models propose an extra premium of +27%, +73%, +82%, or +110% (compared with the average premium for the entire population). Although age discrimination may seem logical, how much difference can be allowed here, and would be perceived as “quantitatively legitimate”? In Sect. 4.1, we present standard approaches used to interpret actuarial predictive models, and explain predicted outcomes.
|
||
2.6 From Technical to Commercial Premiums
|
||
So far, we have discussed heterogeneity in technical pure premiums, but in reallife applications, some additional heterogeneity can yield additional sources of discrimination.
|
||
|
||
2.6 From Technical to Commercial Premiums
|
||
|
||
49
|
||
|
||
2.6.1 Homogeneous Policyholders
|
||
|
||
Technical or actuarial premiums are purely based on risk characteristics, whereas commercial premiums are based on economic considerations. In classical textbooks in economics of insurance (see Dionne and Harrington (1992); Dionne (2000, 2013) or Eisen and Eckles (2011)), homogeneous agents are considered perfectly informed (not in the sense that there is no randomness, but they perfectly know the odds of unfortunate events). They have a utility function u (perfectly known also), a wealth w, and they agree to transfer the risk (or parts of the risk) against the payment of a premium .π if it satisfies
|
||
.u(w − π ) ≥ E u(w − Y ) .
|
||
The utility that they have when paying the premium (on the left-hand side) exceeds the expected utility that they have when keeping the risk (on the right-hand side). Thus, an insurer, also with perfect knowledge of the wealth and utility of the agent (or his or her risk aversion), could ask the following premium, named “indifference premium”.
|
||
Definition 2.9 (Indifference Utility Principle) Let Y be the non-negative random variable corresponding to the total annual loss associated with a given policy; for a policyholder with utility u and wealth w, the indifference premium is
|
||
.π = w − u−1 E u(w − Y ) .
|
||
If u is the identity function, .π = E Y , corresponding to the technical, actuarial, pure premium. And if the agent is risk adverse, u is concave and .π ≥ E Y .
|
||
Consider a simple scenario where a policyholder with a wealth of w, and with a concave utility function u, faces a potential random loss occurring with a probability p, resulting in a total loss of wealth. In Fig. 2.4, we can visualize both .π , the indifference premium (from Definition 2.9), and .E(Y ), the pure premium (from Definition 2.4). In Fig. 2.4, we use .p = 2/5 just to get a visual perspective. Here, the random loss Y takes two values,
|
||
.Y = y2 = w with probability p = 2/5 y1 = 0 with probability 1 − p = 3/5,
|
||
or equivalently, the wealth is
|
||
.w − Y = w − y2 = 0 with probability p = 2/5 w − y1 = w with probability 1 − p = 3/5.
|
||
The technical pure premium is here .π0 = E(Y ) = py2 + (1 − p)y1 = pw, and when paying that premium, the wealth would be .w − π0 = (1 − p)w.
|
||
|
||
50
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
Fig. 2.4 Utility and (ex-post) wealth, with an increasing concave utility function u, whereas the
|
||
straight line corresponds to a linear utility .u0 (risk neutral). Starting with initial wealth .ω, the agent will have random wealth W after 1 year, with two possible states: either .w1 (complete loss, on the left part of the x-axis) or .w2 = ω (no loss, on the right part of the x-axis). Complete loss occurs with .40% chance (.2/5). .π0 is the pure premium (corresponding to a linear utility) whereas .π is the commercial premium. All premiums in the colored area are high enough for the insurance
|
||
company, and low enough for the policyholder
|
||
|
||
If the agent is risk adverse (strictly), .u(w − π0) > E[u(w − Y )], in the sense that the insurance company can ask for a higher premium than the pure premium
|
||
|
||
π0 = E[Y ] > 0
|
||
|
||
: actuarial (pure) premium
|
||
|
||
. π − π0 = w − E[Y ] − u−1 E u(w − Y ) ≥ 0 : commercial loading.
|
||
|
||
We come back to the practice of price optimization in Sect. 2.6.3.
|
||
|
||
2.6.2 Heterogeneous Policyholders
|
||
“Actuarial fairness” refers to the notion of “legitimate” discrimination when it is based on a risk factor. According to Thomas (2012), certain laws forbid discrimination based on age but often include provisions allowing exceptions for insurance underwriting. However, these exceptions typically apply only to differences that can be justified by variances in the underlying risk and the technical premium. In the previous section, we discussed economic models, based on individual wealth w and utility functions u, that can actually be heterogeneous. As mentioned in Boonen and Liu (2022), with information about personal characteristics, the insurer can customize insurance coverage and premium for each individual in order to optimize his/her objective function. Commercial insurance premiums then depend on observable variables that correlate with the individual’s risk-aversion parameter, such as gender, age, or even race (as considered in Pope and Sydnor (2011)). Such discrimination may violate insurance regulations with respect to discrimination.
|
||
|
||
2.6 From Technical to Commercial Premiums
|
||
|
||
51
|
||
|
||
Fig. 2.5 On the left, the same graph as in Fig. 2.4, with utility on the y-axis and (ex-post) wealth on the x-axis, with an increasing concave utility function u, and an ex-post random wealth W after 1 year, with two possible states: either .w1 (complete loss) or .w2 = ω (no loss). Complete loss occurs with .40% chance (.2/5) on the left-hand side, whereas complete loss occurs with .60% chance (.3/5) on the right-hand side. Agents have identical risk aversion and wealth, on both graphs. Indifference premium is larger when the risk is more likely (with the additional black part on the technical pure premium; here, the commercial loading is almost the same)
|
||
|
||
In the context of heterogeneity of the underlying risk only, consider the case in which heterogeneity is captured through covariates .X and where agents have the same wealth w and the same utility u,
|
||
|
||
π0(x) = E[Y |X = x] . π − π0 = w − E[Y |X = x] − u−1 E u(w − Y ) X = x
|
||
|
||
: actuarial premium : commercial loading.
|
||
|
||
For example, in Fig. 2.5, we have on the left, the same example as in Fig. 2.4, corresponding to some “good” risk,
|
||
|
||
.Y = y2 = w with probability p = 2/5 y1 = 0 with probability 1 − p = 3/5.
|
||
|
||
On the right, we have some “bad risk,” where the value of the loss is unchanged, but the probability of claiming a loss is higher (.p' > p). In Fig. 2.5
|
||
|
||
.Y =
|
||
|
||
y2 = w with probability p' = 3/5 > 2/5 y1 = 0 with probability 1 − p' = 2/5.
|
||
|
||
In that case, it could be seen as legitimate, and fair, to ask a higher technical premium, and possibly to add the appropriate loading then.
|
||
|
||
52
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
If heterogeneity is no longer of the underlying risk, but of the risk aversion (or possibly the wealth), if u is now the function of some covariates, .ux we should write
|
||
|
||
π0(x) = μ(x) = E[Y ] > 0
|
||
|
||
: actuarial premium
|
||
|
||
. π − π0 = w − E[Y ] − u−x 1 E ux(w − Y ) ≥ 0 : commercial loading.
|
||
|
||
Here, we used the expected utility approach from Von Neumann and Morgenstern (1953) to illustrate, but alternatives could be considered.
|
||
The approach described previously is also named “differential pricing,” where customers with a similar risk are charged different premiums (for reasons other than risk occurrence and magnitude). Along these lines, Central Bank of Ireland (2021) considered “price walking” as discriminatory. “Price walking” corresponds to the case where longstanding, loyal policyholders are charged higher prices for the same services than customers who have just switched to that provider. This is a welldocumented practice in the telecommunications industry that can also be observed in insurance (see Guelman et al. (2012) or He et al. (2020), who model attrition rate, or “customer churn”). According to Central Bank of Ireland (2021), the practice of price walking is “unfair” and could result in unfair outcomes for some groups of consumers, both in the private motor insurance and household insurance markets. For example, long-term customers (those who stayed with the same insurer for 9 years or more) pay, on average, 14% more for private car insurance and 32% more for home insurance than the equivalent customer renewing for the first time.
|
||
|
||
2.6.3 Price Optimization and Discrimination
|
||
“We define price optimization in P&C insurance [property and casualty insurance, or nonlife insurance6] as the supplementation of traditional supply-side actuarial models with quantitative customer demand models,” explained Bugbee et al. (2014). Duncan and McPhail (2013), Guven and McPhail (2013), and Spedicato et al. (2018) mention that such practices are intensively discussed by practitioners, even if they did not get much attention in the academic journals. Notable exceptions would be Morel et al. (2003), who introduced myopic pricing, whereas more realistic approaches, named “semi-myopic pricing strategies”, were discussed in Krikler et al. (2004) or more recently Ban and Keskin (2021).
|
||
|
||
6 Formally, property covers a home (physical building) and the belongings in it from all losses such as fire, theft, etc., or covers damage to a car when involved in an accident, including protection from damage/loss caused by other factors such as fire, vandalism, etc. Causality involves coverage if one is being held responsible for someone injuring themselves on his or her property, or if one were to cause any damage to someone else’s property, and coverage if one gets into an accident and causes injuries to someone else or damage to their car.
|
||
|
||
2.6 From Technical to Commercial Premiums
|
||
|
||
53
|
||
|
||
Many regulators believe that price optimization is unfairly discriminatory (as shown in Box 2.2, with some regulations in some states, in the USA). Is it legitimate discrimination to have premiums function on willingness or ability to pay, and risk aversion? According to the Code of Professional Conduct7 (Precept 1, on “Professional Integrity”), “an actuary shall act honestly (...) to fulfill the profession’s responsibility to the public and to uphold the reputation of the actuarial profession.”
|
||
|
||
Box 2.3 Price Optimization in the USA
|
||
• Alaska, Wing-Heir (2015)
|
||
The practice of adjusting either the otherwise applicable manual rates or premiums or the actuarially indicated rates or premiums based on any of the following is considered inconsistent with the statutory requirement that “rates shall not be (...) unfairly discriminatory,” whether or not such adjustment is included within the insurer’s rating plan:
|
||
(a) Price elasticity of demand; (b) Propensity to shop for insurance; (c) Retention adjustment at an individual level; and (d) A policyholder’s propensity to ask questions or file complaints.
|
||
• California, Volkmer (2015)
|
||
Price Optimization does not seek to arrive at an actuarially sound estimate of the risk of loss and other future costs of a risk transfer. Therefore, any use of Price Optimization in the ratemaking/pricing process or in a rating plan is unfairly discriminatory in violation of Californian law.
|
||
• District of Columbia, Taylor (2015)
|
||
Price optimization refers to an insurer’s practice of charging the maximum premium that it expects an individual or class of individuals to bear, based upon factors that are neither risk of loss related nor estimated expense related. For example, an insurer may charge a nonprice-sensitive individual a higher premium than it would charge a price-sensitive individual; despite their risk characteristics being equal. This practice is discriminatory and it violates the District’s anti-discrimination insurance laws codified at D.C. Official Code §31-2231.13(c), 31-2703(a) and 31-2703(b).
|
||
(continued)
|
||
|
||
7 See https://www.soa.org/about/governance/about-code-of-professional-conduct/.
|
||
|
||
54
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
Box 2.3 (continued) • Pennsylvania, Miller (2015b)
|
||
With the advent of sophisticated pricing tools, including computer software and rating models referred to as price optimization, insurers, rating organizations, and advisory organizations are reminded that policyholders and applicants with identical risk classification profiles—that is, risks of the same class and essentially the same hazard—must be charged the same premium. Rates that fail to reflect differences in expected losses and expenses with reasonable accuracy are unfairly discriminatory under Commonwealth law and will not be approved by the Department.
|
||
|
||
2.7 Other Models in Insurance
|
||
So far, we have discussed only premium principles, but predictive models are used almost everywhere in insurance.
|
||
2.7.1 Claims Reserving and IBNR
|
||
Another interesting application is reserving (see Hesselager and Verrall (2006) or Wüthrich and Merz (2008) for more details). Loss reserves are a major item in the financial statement of an insurance company and in terms of how it is valued from the perspective of possible investors. The development and the release of reserves are furthermore important input variables to calculate the MCEV (market consistent embedded value), which “provides a means of measuring the value of such business at any point in time and of assessing the financial performance of the business over time,” as explained in American Academy of Actuaries (2011). Hence, the estimates of unpaid losses give management important input for their strategy, pricing, and underwriting. A reliable estimate of the expected losses is therefore crucial. Traditional models of reserving for future claims are mainly based on claims triangles (e.g., Chain Ladder or Bornhütter-Ferguson as distributionfree methods, as described in De Alba (2004)) or distribution-based (stochastic) models with aggregated data at the level of the gross insurance portfolio or at the level of a subportfolio as the methodology requires the use of portfolio-based parameters, e.g., reported or paid losses, prior expected parameters such as losses or premiums. The reserving amount can be influenced by many factors, for example, the composition of the claim, medical advancement, life expectancy, legal changes, etc. The consequence is a loss of potentially valuable information at the level of the single contract as the determining drivers are entirely disregarded.
|
||
|
||
2.7 Other Models in Insurance
|
||
|
||
55
|
||
|
||
2.7.2 Fraud Detection
|
||
|
||
Fraud is not self revealed, and therefore, it must be investigated, said Guillen (2006) and Guillen and Ayuso (2008). Tools for detecting fraud span all kind of actions undertaken by insurers. They may involve human resources, data mining, external advisors, statistical analyses, and monitoring. The methods currently available for detecting fraudulent or suspicious claims based on human resources rely on video or audiotape surveillance, manual indicator cards, internal audits, and information collected from agents or informants. Methods based on data analysis seek external and internal data information. Automated methods use various machine-learning techniques. such as selecting fuzzy set clustering in Derrig and Ostaszewski (1995), simple regression models in Derrig and Weisberg (1998), or GLMs, with a logistic regression in Artıs et al. (1999); Artís et al. (2002) and a probit model in Belhadji et al. (2000), or neural networks, in Brockett et al. (1998) or Viaene et al. (2005).
|
||
|
||
2.7.3 Mortality
|
||
The modeling and forecasting of mortality rates have been subject to extensive research in the past. The most widely used approach is the “Lee-Carter Model,” from Lee and Carter (1992) with its numerous extensions. More recent approaches involve nonlinear regression and GLMs. But recently, many machine-learning algorithms have been used to detect (unknown) patterns, such as Levantesi and Pizzorusso (2019), with decision trees, random forests, and gradient boosting. Perla et al. (2021) generalized the Lee-Carter Model with a simple convolutional network.
|
||
|
||
2.7.4 Parametric Insurance
|
||
Parametric insurance is also an area where predictive models are important. Here, we consider guaranteed payment of a predetermined amount of an insurance claim upon the occurrence of a stipulated triggering event, which must be some predefined parameter or metric specifically related to the insured person’s particular risk exposure, as explained in Hillier (2022) or Jerry (2023).
|
||
|
||
2.7.5 Data and Models to Understand the Risks
|
||
So far, we mentioned the use of data and models in the context of estimating a “fair premium.” But it should also be highlighted that insurance companies have helped to improve the quality of life in many countries, using data that they
|
||
|
||
56
|
||
|
||
2 Fundamentals of Actuarial Pricing
|
||
|
||
collected. A classic example is the development of epidemiology. Indeed, as early as the nineteenth century, insurance doctors initiated an approach that prefigures systematic medical examinations, developing our contemporary medicine, based on prevention, or increasingly oriented toward patients who do not look after themselves. As early as 1905, John Welton Fischer, medical director of the Northwestern Mutual Life Insurance Company and a member of the Association of Life Insurance Medical Directors of America, became interested in the routine measurement of blood pressure in the examination of life insurance applicants. He was the first to do so at a time when the blood pressure monitor, which had just been invented, had not really proved its worth. He was the first to do so at a time when the newly invented tensiometer had not really proved its worth and was still confined to experimental use. At the beginning of 1907, Fischer began to measure the systolic blood pressure of applicants aged between 40 and 60 years. He then instructed his company’s physicians to perform this measurement in cities with more than 100,000 inhabitants. By 1913, 85% of his company’s applicants had had their blood pressure measured. Although Fischer’s conclusions are clear, he does not explain how he foresaw the importance of this measurement as a risk factor. When Fischer proposed the introduction of blood pressure measurement for the newly insured person, there was no information on the prognosis associated with elevated blood pressure, nor was there a clear definition of what “normal” pressure should be, as Kotchen (2011) recalls. The relationship between blood pressure and cardiovascular morbidity was still completely unknown, despite some work by clinicians. Insurance companies produced the first prospective statistics for hypertension, a term that did not then refer to any well-defined disease or concept. In 1911, Fischer wrote a letter to the Medical Directors Association explaining to his peers that “the sphygmomanometer is indispensable in life insurance examinations, and the time is not far distant when all progressive life insurance companies will require its use in all examinations of applicants for life insurance.”
|
||
In 1915, the Prudential Life Insurance Company had already measured the blood pressure of 18,637 applicants, the New York Life Insurance Company that of 62,000 applicants for insurance and, in 1922, the New York experiment of the Metropolitan Life Insurance Company totaled 500,000 examinations in more than 8000 insured persons, recalls Dingman (1927). No private practitioner, no hospital doctor, no organization, until then, had been able to compile such statistics. In a series of reports that began with Dublin (1925), the Actuarial Society of America described the distribution of blood pressure across the population, the age-related increases in blood pressure, and the relationships of blood pressure to both body size and mortality. This report studied a cohort of 20,000 insured persons, aged 38 to 42 years, with measurements of systolic and diastolic blood pressure. The report showed an increase in systolic and diastolic blood pressures with age. At younger ages, systolic and diastolic blood pressures were lower in women than in men. Blood pressure also increased progressively with age in both men and women. The report also showed that systolic and diastolic blood pressure increased with height in men, defined in terms of “build groups” (average weight for each inch of height) in different age groups of men. He eventually noted that changes in diastolic blood
|
||
|
||
2.7 Other Models in Insurance
|
||
|
||
57
|
||
|
||
pressure were more important than changes in systolic blood pressure in predicting mortality. For insurers, this information, although measured on an ad hoc basis, was sufficient to exclude certain individuals or to increase their insurance premiums. The designation of hypertension as a risk factor for reduced life expectancy was not based on research into the risk factors for hypertension, but on a simple measure of correlation and risk analysis. And the existence of a correlation did not necessarily indicate a causal link, but this was not the concern of the many physicians working for insurers. Medical research was then able to work on a better understanding of these phenomena, observed by the insurers, who had access to these data (because they had the good idea of collecting them).
|
||
|
||
Chapter 3
|
||
Models: Overview on Predictive Models
|
||
|
||
Abstract In this chapter, we give an overview on predictive modeling, used by actuaries. Historically, we moved from relatively homogeneous portfolios to tariff classes, and then to modern insurance, with the concept of “premium personalization.” Modern modeling techniques are presented, starting with econometric approaches, before presenting machine-learning techniques.
|
||
As we have seen in the previous chapter, insurance is deeply related to predictive modeling. But contrary to popular opinion that models and algorithms are purely objective, O’Neil (2016) explains in her book that “models are opinions embedded in mathematics (.. . .). A model’s blind spots reflect the judgments and priorities of its creators.” In this chapter (and the next one), we get back to general ideas about actuarial modeling.
|
||
3.1 Predictive Model, Algorithms, and “Artificial Intelligence”
|
||
3.1.1 Probabilities and Predictions
|
||
We will not start a philosophical discussion about risk and uncertainty here. However, in actuarial science, all stories begin with a probabilistic model. “Probability is the most important concept in modern science, especially as nobody has the slightest notion what it means” said Bertrand Russell in a conference, back in the early 1930s, quoted in Bell (1945). Very often, the “physical” probabilities receive an objective value, on the basis of the law of large numbers, as the empirical frequency converge toward “the probability” (frequentist theory of probabilities).
|
||
|
||
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
|
||
|
||
59
|
||
|
||
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
|
||
|
||
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_3
|
||
|
||
60
|
||
|
||
3 Models: Overview on Predictive Models
|
||
|
||
Proposition 3.1 (Law of Large Numbers (1)) Consider an infinite collection of i.i.d. random variables .X, X1, X2, · · · , Xn, · · · in a probabilistic space .( , F, P), then
|
||
|
||
1
|
||
.
|
||
n
|
||
|
||
n
|
||
|
||
1(Xi ∈ A) → a.s. P({X ∈ A)} = P[A], as n → ∞.
|
||
|
||
i=1
|
||
|
||
probability
|
||
|
||
(empirical) frequency
|
||
|
||
Proof Strong law of large numbers (also called Kolmogorov’s law), see Loève (1977), or any probability textbook.
|
||
This is a so-called “physical” probability, or “objective.” It means that if we throw a die a lot of times (here n), the “probability” of obtaining a 6 with this die is the empirical frequency of 6 we obtained. Of course, with a perfectly balanced die, there is no need to repeat throws of die to affirm that the probability of obtaining 6 at the time of a throw is equal to .1/6 (by the symmetry of the cube). But if we repeat the experience of throwing a die millions of time, .1/6 should be close to the frequency of appearance of 6, corresponding to the “frequentist” definition of the concept of “probability.” Almost 200 years ago, Cournot (1843) already distinguished an “objective meaning” of the probability (as a measure of the physical possibility of realization of a random event) and a “subjective meaning” (the probability being a judgment made on an event, this judgment being linked to the ignorance of judgment being linked to the ignorance of the conditions of the realization of the event).
|
||
If we use that “frequentist” definition (also coined “long-run probability” in Kaplan (2023) as Proposition 3.1 is an asymptotic result), we are unable to make sense of the probability of a “single singular event,” as noted by von Mises (1928, 1939): “When we speak of the ‘probability of death’, the exact meaning of this expression can be defined in the following way only. We must not think of an individual, but of a certain class as a whole, e.g., ‘all insured men forty-one years old living in a given country and not engaged in certain dangerous occupations’. A probability of death is attached to the class of men or to another class that can be defined in a similar way. We can say nothing about the probability of death of an individual even if we know his condition of life and health in detail. The phrase ‘probability of death’, when it refers to a single person, has no meaning for us at all.” And there are even deeper paradoxes, that can be related to latent risk factors discussed in the previous chapter, and the “true underlying probability” (to claim a loss, or to die). In a legal context, Fenton and Neil (2018) quoted a judge, who was told that a person was less than .50% guilty: “look, the guy either did it or he did not do it. If he did then he is 100% guilty and if he did not then he is 0% guilty; so giving the chances of guilt as a probability somewhere in between makes no sense and has no place in the law.” The main difference with actuarial pricing, is that we should estimate probabilities associated with future events. But still, one can wonder if “the true probability” is a concept that makes sense when signing a contract. Thus, the goal here will be to train a model that will compute
|
||
|
||
3.1 Predictive Model, Algorithms, and “Artificial Intelligence”
|
||
|
||
61
|
||
|
||
a score, that might be interpreted as a “probability” (this will raise the question of “ calibration” of a model, the connection between that score and the “observed frequencies” (interpreted as probabilities), as discussed in Sect. 4.3.3).
|
||
Given a probability measure .P, one can define “conditional probabilities,” the standard notation being the vertical bar. .P[A|B] is the conditional probability that event .A occurs given the information that event .B occurred. It is the ratio of the probability that both .A and .B occurred (corresponding to .P[A ∩ B]) over the probability that .B occurred. Based on that definition, we can derive Bayes formula.
|
||
Proposition 3.2 (Bayes Formula) Given two events .A and .B such that .P[B] = 0,
|
||
|
||
.P[A|B]
|
||
|
||
=
|
||
|
||
P[B|A] · P[A] P[B]
|
||
|
||
∝
|
||
|
||
P[B|A]
|
||
|
||
·
|
||
|
||
P[A].
|
||
|
||
Proof Bayes (1763) and Laplace (1774).
|
||
Besides the mathematical expression, that formula has two possible interpretations. The first one corresponds to an “update of beliefs,” from a prior distribution .P[A] to a posterior distribution .P[A|B], given some additional information B. The second one is related to an “inverse problem,” where we try to determine the causes of a phenomenon from the experimental observation of its effects. An example could be the one where .A is a disease and .B is a symptom (or a set of symptoms), and with Bayes’ rule (see Spiegelhalter et al. (1993) for more details, with multiple diseases and multiple symptoms).
|
||
|
||
.P[disease|symptom] ∝ P[symptom|disease] · P[disease].
|
||
|
||
Another close example would be one where .B is the result of a test, and
|
||
|
||
.P[cancer|test positive] ∝ P[test positive|cancer] · P[cancer].
|
||
|
||
A convention in actuarial literature is to suppose that random variables live in a probabilistic space .( , F, P), without much discussion about the probability measure .P. In most sections of this book, .P is the (unobservable) probability measure associated with the portfolio of the insurer (or associated with the training dataset). And we will use .Pn, associated with sample .Dn, as explained in Sect. 1.2.1. For instance, using the validation dataset from GermanCredit, let X denote the age of the person with a loan, and .Y the score of a random person (obtained from a logistic regression), and the gender is the sensitive attribute .S ∈ {A, B}, so that
|
||
|
||
Pn[X ∈ [18; 25]|S = A] = 32% and Pn[Y > 50%|S = A] = 25%
|
||
.
|
||
Pn[X ∈ [18; 25]|S = B] = 10% and Pn[Y > 50%|S = B] = 21%.
|
||
|
||
62
|
||
|
||
3 Models: Overview on Predictive Models
|
||
|
||
But because there is competition in the market, .Pn can be different than .P, the probability measure for the entire population
|
||
|
||
P[X ∈ [18; 25]|S = A] = 20% and P[Y > 50%|S = A] = 20%
|
||
.
|
||
P[X ∈ [18; 25]|S = B] = 15% and P[Y > 50%|S = B] = 15%.
|
||
There could also be some target probability measure .P as underwriters can be willing to target some specific segments of the population, as discussed in Chaps. 7 and 12 (on change of measures),
|
||
|
||
P [X ∈ [18; 25]|S = A] = 25% and P [Y > 50%|S = A] = 25%
|
||
.
|
||
P [X ∈ [18; 25]|S = B] = 20% and P [Y > 50%|S = B] = 20%,
|
||
or possibly some “fair” measure .Q, as we discuss in Chaps. 8 (on quantifying group fairness) and 12 (when mitigating discrimination), that will satisfy some independence properties,
|
||
|
||
Q[X ∈ [18; 25]|S = A] = 39% and Q[Y > 50%|S = A] = 23%
|
||
.
|
||
Q[X ∈ [18; 25]|S = B] = 15% and Q[Y > 50%|S = B] = 23%.
|
||
It is also possible to mention here the fact that the model is fitted on past data, associated with probability measure, .Pn but because of the competition on the market, or because of the general economic context, the structure of the portfolio might change. The probability measure for next year will then be .P n (with
|
||
. Pn [X ∈ [18; 25]|S = A] = 35% and Pn [Y > 50%|S = A] = 27% Pn [X ∈ [18; 25]|S = B] = 20% and Pn [Y > 50%|S = B] = 27%,
|
||
if our score, used to assess whether we give a loan to some clients attracts more young (and more risky) people. We do not discuss this issue here, but the “generalization” property should be with respect to a new unobservable and hardto-predict probability measure .Pn (and not .P as usually considered in machine learning, as discussed in the next sections).
|
||
|
||
3.1.2 Models
|
||
Predictive models are used to capture a relationship between a response variable y (typically a claim occurrence, a claim frequency, a claim severity, or an annual
|
||
|
||
3.1 Predictive Model, Algorithms, and “Artificial Intelligence”
|
||
|
||
63
|
||
|
||
Fig. 3.1 A simple linear model, a piecewise constant (green) model, or a complex model (nonlinear but continuous), from ten observations .(xi , yi ), where x is a temperature in degrees Fahrenheit and y is the temperature in degrees Celsius, at the same location i
|
||
cost) and a collection of predictors, denoted1 .x, as explained in Schmidt (2006). If y is binary, a classical model will be the logistic regression, for example (see Dierckx 2006), and more generally, actuaries have used intensively generalized linear models (GLMs, see Frees (2006) or Denuit et al. (2019a)), where a prediction of the outcome y is obtained by transforming a linear combination of predictors. Econometric models (and GLMs) are popular as they rely strongly on probabilistic models, and the insurance business is based on the randomness of events.
|
||
For Ekeland (1995), modeling is the (intellectual) construction of a mathematical model, i.e., a network of equations supposed to describe reality. Very often, a model is also, above all, a simplification of this reality. A model that is too complex is not a good model. This is the idea of over-learning (or “overfitting”) that is found in statistics, or the concept of parsimony, sometimes called “Ockham’s razor” (as in Fig. 3.1), which is typical in econometrics and discussed by William of Ockham (in the fourteenth century). As Milankovic´ (1920) stated, “in order to be able to translate the phenomena of nature into mathematical language, it is always necessary to admit simplifications and to simplify certain influences and irregularities.” The model is a simplification of the world, or, as Korzybski (1958) said in a geography context, “a map is not the territory it represents, but, if correct, it has a similar structure to the territory, which accounts for its usefulness.” The map is not the territory: the map reflects our representation of the world, whereas the territory is the world as it really is. We naturally think of Borges (1946) (or Umberto Ecco’s pastiche of the impossibility of constructing the map 1: 1 of the Empire, in Eco (1992)),2 “en aquel Imperio, el Arte de la Cartografía logró tal Perfección que el mapa de una sola Provincia ocupaba toda una Ciudad, y el mapa del Imperio,
|
||
1 As discussed previously, notation .z is also used later on, and we distinguish admissible predictors .x, and sensitive ones .s. In this chapter, we mainly use .x, as in most textbooks. 2 “In that Empire, the Art of Cartography achieved such Perfection that the map of a single Province occupied an entire City, and the map of the Empire, an entire Province. In time, these
|
||
|
||
64
|
||
|
||
3 Models: Overview on Predictive Models
|
||
|
||
toda una Provincia. Con el tiempo, estos Mapas Desmesurados no satisficieron y los Colegios de Cartógrafos levantaron un Mapa del Imperio, que tenía el tamaño del Imperio y coincidía puntualmente con él.”
|
||
The notion of model seems to have been replaced by the term “algorithm”, or even “artificial intelligence”, or for short A.I. (notably in the press, see Milmo (2021), Swauger (2021) or Smith (2021) among many others). For Zafar et al. (2019), “algorithm” means predictive models (decision rules) calibrated from historical data through data-mining techniques. To understand the difference, Cardon (2019) gives an example to explain what machine learning is. It is quite simple to write a program that converts a temperature in degrees Fahrenheit to a temperature given in degrees Celsius. To do this, there is a simple rule: subtract 32 from the temperature in degrees Fahrenheit x and multiply the result by .5/9 (or divide by 1.8), i.e.,
|
||
|
||
.y
|
||
|
||
←
|
||
|
||
5 (x
|
||
|
||
− 32).
|
||
|
||
9
|
||
|
||
A machine learning (or artificial intelligence) approach offers a very different solution. Instead of coding the rule into the machine (what computer scientists might call “Good Old Fashioned Artificial Intelligence,” as Haugeland (1989)), we simply give to the machine several examples of matches between temperatures in Celsius and Fahrenheit .(xi, yi). We enter the data into a training dataset, and the algorithm will learn a conversion rule by itself, looking for the closest candidate function to the data. We can then find an example like the one in Fig. 3.1, with some data and different models (one simple (linear) and one more complex).
|
||
It is worth noting that the “complexity” of certain algorithms, or their “opacity” (which leads to the term “black box”), has nothing to do with the optimization algorithm used (in deep learning, back-propagation is simply an iterative mechanism for optimizing a clearly described objective). It is mainly that the model obtained may seem complex, impenetrable, to take into account the possible interactions between the predictor variables, for example. For the sake of completeness, a distinction should be made between classical supervised machine-learning algorithms and reinforcement learning techniques. The latter case describes sequential (or iterative) learning methods, where the algorithm learns by experimenting, as described in Charpentier et al. (2021). We find these algorithms in automatic driving, for example, or if we wanted to correctly model the links between the data, the constructed model, the new connected data, the update of the model, etc. But we will not insist more on this class of models here.
|
||
To conclude this first section, let us stress that in insurance models, the goal is not to predict “who” will die, and get involved in a car accident. Actuaries create scores that are interpreted as the probability of dying, or the probability of
|
||
|
||
inordinate maps were not satisfactory and the Colleges of Cartographers drew up a map of the Empire, which was the size of the Empire and coincided exactly with it” [personal translation].
|
||
|
||
3.2 From Categorical to Continuous Models
|
||
|
||
65
|
||
|
||
getting a bodily injury claim, in order to compute “fair” premiums. To use a typical statistical example, let y denote the face of a die, potentially loaded. If p is the (true) probability of falling on 6 (say .14.5752%), we say at first that we are able to acquire information about the way the die was made, about its roughness, its imperfections, that will allow us to refine our knowledge on this probability, but also that we have a model able to link the information in an adequate way. Knowing better the probability of falling on 6 does not guarantee that the die will fall on 6, the random component does not disappear, and will never disappear. Translated into the insurance problem, p might denote the “true” probability that a specific policyholder will get involved in a car accident. Based on external information .x, some model will predict that the probability of being in an accident is .px (say .11.1245%). As mentioned by French statistician Alfred Sauvy, “dans toute statistique, l’inexactitude du nombre est compensée par la précision des décimales” (or “in all statistics, the inaccuracy of the number is compensated for by the precision of the decimals,” infinite precision we might add). The goal is not to find a model that returns either .0% or .100% (this happens with several machine-learning algorithms), simply to assess with confidence a valid probability, used to compute for a “fair” premium. And an easy way is perhaps to use simple categories: the probability of getting a 6 less than 15% (less than a fair dice), between 15% and 18.5% (close to a fair dice), and more than .18.5% (more than a fair dice).
|
||
|
||
3.2 From Categorical to Continuous Models
|
||
“Risk classification” is an old and natural way to get insurance premiums, as explained in Sect. 2.3.3. Not only higher-rated insured persons are less likely to engage in the risky activity, risk classification provides incentives for risk reduction (merit rating in auto insurance encourages adherence to traffic regulations; experience-rating in workers’ compensation encourages employers to eliminate workplace hazards, etc.). And as suggested by Feldblum (2006), risk classification also promotes social welfare by making insurance coverage more widely available.
|
||
|
||
3.2.1 Historical Perspective, Insurers as Clubs
|
||
In ancient Rome, a collegium (plural collegia) was an association. They functioned as social clubs, or religious collectives, whose members worked toward their shared interests, as explained in Verboven (2011). During Republican Rome (which began with the overthrow of the Roman Kingdom, 509 before the common era, and ended with the establishment of the Roman Empire, 27 before our era), military collegia were created. As explained in Ginsburg (1940), upon the completion of his service a veteran had the right to join one of the many collegia veteranorum in each legion. The Government established special savings banks. Half of the cash bonuses,
|
||
|
||
66
|
||
|
||
3 Models: Overview on Predictive Models
|
||
|
||
donativa, which the emperors attributed to the soldiers on various occasions, was not handed over to the beneficiaries in cash but was deposited to the account of each soldier in his legion’s savings bank. This could be seen as an insurance scheme, and risks against which a member was insured were diverse. In the case of retirement, upon the completion of his term of service, the soldier would receive a lump sum that helped him to somewhat arrange the rest of his life. The membership in a collegium gave him a mutual insurance against “unforeseen risks.” These collegia, besides being cooperative insurance companies, had other functions. And because of the structure of those collegia based on corporatism, members were quite homogeneous.
|
||
Sometime in the early 1660s, the Pirate’s Code was supposedly written by the Portuguese buccaneer Bartolomeu Português. And interestingly, a section is explicitly dedicated to insurance and benefits: “a standard compensation is provided for maimed and mutilated buccaneers. Thus they order for the loss of a right arm six hundred pieces of eight, or six slaves; for the loss of a left arm five hundred pieces of eight, or five slaves; for a right leg five hundred pieces of eight, or five slaves; for the left leg four hundred pieces of eight, or four slaves; for an eye one hundred pieces of eight, or one slave; for a finger of the hand the same reward as for the eye,” see Barbour (1911) (or more recently Leeson (2009) and Fox (2013) about this piratical scheme).
|
||
In the nineteenth century, in Europe, mutual aid societies involved a group of individuals who made regular payments into a common fund in order to provide for themselves in later, unforeseeable moments of financial hardship or of old age. As mentioned by Garrioch (2011), in 1848, there were 280 mutual aid societies in Paris with well over 20,000 members. For example, the Société des Arts Graphiques, was created in 1808. It admitted only men over 20 and under 50, and it charged much higher admission and annual fees for those who joined at a more advanced age. In return, they received benefits if they were unable to work, reducing over a period of time, but in the case of serious illness the Society would pay the admission fee for a hospice. In England, there were “friendly societies,” as described in Ismay (2018). In France, after the 1848 revolution and Louis Napoléon Bonaparte’s coup d’état in 1851, mutual funds were seen as a means of harmonizing social classes. The money collected through contributions came to the rescue of unfortunate workers, who would no longer have any reason to radicalize. It was proposed that insurance should become compulsory (Bismarck proposed this in Germany in 1883), but the idea was rejected in favor of giving workers the freedom to contribute, as the only way to moralize the working classes, as Da Silva (2023) explains. In 1852, of the 236 mutual funds created, 21 were on a professional basis, whereas the other 215 were on a territorial basis. And from 1870 onward, mutual funds diversified the professional profile of contributors beyond blue-collar workers, and expanded to include employees, civil servants, the self-employed, and artists. But the amount of the premium is not linked to the risk. As Da Silva (2023) puts it,“mutual insurers see in the actuarial figure the programmed end of solidarity.” For mutual funds, solidarity is essential, with everyone contributing according to their means and receiving according to their needs. At around the same time, in France, the first insurance companies appeared, based on risk selection, and the first mathematical
|
||
|
||
3.2 From Categorical to Continuous Models
|
||
|
||
67
|
||
|
||
approaches to calculating premiums. Hubbard (1852) advocates the introduction of an “English-style scientific organization” in their management. For its members, they had to be able to know “the probable average of the claims” that they should cover, like insurance companies. The development of tables should lead insurers to adopt the principle of contributions varying according to the age of entry and the specialization of contributions and funds (health/retirement). It is with this in mind that they drew up tables. For Stone (1993) and Gowri (2014) the defining feature of “modern insurance” is its reliance on segmenting the risk pool into distinct categories, each receiving a price corresponding to the particular risk that the individuals assigned to that category are expected to represent (as accurately as can be estimated by actuaries).
|
||
|
||
3.2.2 “Modern Insurance” and Categorization
|
||
Once heterogeneity with respect to the risk was observed in portfolios, insurers have operated by categorizing individuals into risk classes and assigning corresponding tariffs. This ongoing process of categorization ensures that the sums collected, on average, are sufficient to address the realized risks within specific groups. The aim of risk classification, as explained in Wortham (1986), is to identify the specific characteristics that are supposed to determine an individual’s propensity to suffer an adverse event, forming groups within which the risk is (approximately) equally shared. The problem, of course, is that the characteristics associated with various types of risk are almost infinite; as they cannot all be identified and priced in every risk classification system, there will necessarily be unpriced sources of heterogeneity between individuals in a given risk class.
|
||
In 1915, as mentioned in Rothstein (2003), the president of the Association of Life Insurance Medical Directors of America noted that the question asked almost universally of the Medical Examiner was “What is your opinion of the risk? Good, bad, first-class, second-class, or not acceptable?” Historically, insurance prices were a (finite) collection of prices (maybe more than the two classes mentioned, “first-class” and “second-class”). In Box 3.1, in the early 1920s, Albert Henry Mowbray, who worked for the New York Life Insurance Company and later Liberty Mutual (and was also an actuary for state-level insurance commissions in New Carolina and California, and the National Council on Workmen’s Insurance) gives his perspective on insurance rate making.
|
||
Box 3.1 Historical perspective, Albert Henry Mowbray (1921) “Classification of risks in some manner forms the basis of rate making in practically all branches of insurance. It would appear therefore that there should be some fundamental principle to which a correct system of
|
||
(continued)
|
||
|
||
68
|
||
|
||
3 Models: Overview on Predictive Models
|
||
|
||
Box 3.1 (continued) classification in any branch of insurance should conform (.. . .) As long ago as the days of ancient Greece and Rome the gradual transition of natural phenomena was observed and set down in the Latin maxim, ‘natura non agit per altum’. If each risk, therefore is to be precisely rated, it would be necessary to recognize very minute differences and precisely measure them. (.. . .) Since we are not capable of covering a large field fully and at the same time recognizing small differences in all parts of the field, it is natural that we resort to subdivision of the field by means of classification, thereby concentrating our attention on a smaller interval which may again be subdivided by further classification, and the system so carried on to the limit to which we find it necessary or desirable to go. But however far we may go in any system of classification, whether in the field of pure or applied science including the business or insurance, we shall always find difficulties presented by the borderline case, difficulties which arise from the continuous character of natural phenomena which we are attempting to place in more or less arbitrary divisions. While thus acknowledging that classification will never completely solve the problem of recognizing differences between individuals, nevertheless classification seems to be necessary at least as a preliminary step toward such recognition in any field of study. The fact that a complete and final solution cannot be made is, therefore, no justification for completely discarding classification as a method of approach. Since it is insurance hazards that we undertake to measure and classify, the preliminary step in studying classification theory may well be to ask what is an insurance hazard and how it may be determined. It must be evident to the members of this Society that an insurance hazard is what is termed “a mathematical expectation,” that is a product of a sum at risk and the probability of loss from the conditions insured against, e.g., the destruction of a piece of property by fire, the death of an individual, etc. If the net premiums collected are so determined on the basis of the true natural probability and there is a sufficient spread then the sums collected will just cover the losses and this is what should be.”
|
||
“1. The classification should bring together risks which have inherent in their operation the same causes of loss. 2. The variation from risk to risk in the strength of each cause or at least of the more important should not be greater than can be handled by the formula by which the classification is subdivided, i.e., the Schedule and / or Experience Rating Plan used. 3. The classification should not cover risks which include, as important elements of their hazard, causes which are not common to all. 4. The classification system and the formula for its extension (Schedule and / or Experience Rating Plans) should be harmonious. 5. The basis throughout should be the outward, recognizable indicia of the presence and potency of the several inherent causes of loss including extent as well as occurrence of loss.”
|
||
|
||
3.2 From Categorical to Continuous Models
|
||
|
||
69
|
||
|
||
Several articles and textbooks in sociology tried to understand how classification mechanisms establish symbolic boundaries that reinforce group identities, such as Bourdieu (2018), Lamont and Molnár (2002), Massey (2007), Ridgeway (2011), Fourcade and Healy (2013), or Brubaker (2015). But here, those “groups” or “classes” do not share any identity, and Simon (1988) or Harcourt (2015b) use the term “actuarial classification” (where “actuarial” designates any decision-making technique that relies on predictive statistical methods, replacing more holistic or subjective forms of judgment). In those class-based systems, based on insurance rating tables (or grids), results are determined by assigning individuals to a group in which each person is positioned as “average” or “typical”. [Most] “actuaries cannot think of individuals except as members of groups” claimed (Brilmayer et al. 1979). Each individual is allocated the same value as all other members of the group to which it is assigned (as opposed to models discussed in Sect. 3.3, where a model gives to each individual its own unique value or score, as close as possible, as explained in Fourcade (2016)). Simon (1987, 1988), and then Feeley and Simon (1992), defined “actuarialism,” that designate the use of statistics to guide “class-based decision-making,” used to price pensions and insurance. As explained in Harcourt (2015b), this “actuarial classification” is the constitution of groups with no experienced social significance for the participants. A person classified as a particular risk by an insurance company shares nothing with the other people thus classified, apart from a series of formal characteristics (e.g., age, sex, marital status). As we see in Sect. 4.1 on interpretability, actuaries try ex-post to give social representations to those groups. For Austin (1983) and Simon (1988), categories used by the insurance company when grouping risks are “singularly sterile,” resulting in inert, immobile, and deactivated communities, corresponding to “artificial” groups. These are not groups organized around a shared history, common experiences, or active commitment, forming some “aggregates”—living only in the imagination of the actuary who calculates and tabulates, not in any lived form of human association. If Hacking (1990) observed that standard classes create coherent group identities (causing possible stereotypes and discrimination, as we discuss in Part III), Simon (1988), provocatively suggests that actuarial classifications can in turn “undo people’s identity.” As mentioned in Abraham (1986), the goal for actuaries is to create groups, or “classes” made up of individuals who share a series of common characteristics and are therefore presumed to represent the same risk. Following François (2022), we could claim that actuarial techniques reduce individuals to a series of formal roles that have no “moral density” and therefore do not grant an “identity” that organizes a coherent sense of self. And the inclusion of nominally “demoralized categories,” such as gender, in class-based rating systems makes their total demoralization difficult to achieve—and is in itself an issue of struggle. Heimer (1985) used the term “community of fate.” These “communities” created artificially by statisticians are, in that sense, very different from the communities of workers, neighbors, and co-religionists that characterized the traditional mutual organizations displaced by modern forms of insurance, as explained in Gosden (1961), Clark and Clark (1999), Levy (2012), or Zelizer (2017). Furthermore, Rouvroy et al. (2013) and Cheney-Lippold (2017) point out
|
||
|
||
70
|
||
|
||
3 Models: Overview on Predictive Models
|
||
|
||
that scoring technologies are continually swapping predictors, “shuffling the cards,” so that there is no stable basis for constructing group memberships, or a coherent sense.
|
||
Harry S. Havens in the late 1970s gave the description mentioned in Box 3.2.
|
||
|
||
Box 3.2 Historical perspective, Harry S. Havens (1979) “The price which a person pays for automobile insurance depends on age, sex, marital status, place of residence and other factors. This risk classification system produces widely differing prices for the same coverage for different people. Questions have been raised about the fairness of this system, and especially about its reliability as a predictor of risk for a particular individual. While we have not tried to judge the propriety of these groupings, and the resulting price differences, we believe that the questions about them warrant careful consideration by the State insurance departments. In most States the authority to examine classification plans is based on the requirement that insurance rates are neither inadequate, excessive, nor unfairly discriminatory. The only criterion for approving classifications in most States is that the classifications be statistically justified—that is, that they reasonably reflect loss experience. Relative rates with respect to age, sex, and marital status are based on the analysis of national data. A youthful male driver, for example, is charged twice as much as an older driver all over the country (.. . .) It has also been claimed that insurance companies engage in redlining—the arbitrary denial of insurance to everyone living in a particular neighborhood. Community groups and others have complained that State regulators have not been diligent in preventing redlining and other forms of improper discrimination that make insurance unavailable in certain areas. In addition to outright refusals to insure, geographic discrimination can include such practices as: selective placement of agents to reduce business in some areas, terminating agents and not renewing their book of business, pricing insurance at unaffordable levels, and instructing agents to avoid certain areas. We reviewed what the State insurance departments were doing in response to these problems. To determine if redlining exists, it is necessary to collect data on a geographic oasis. Such data should include current insurance policies, new policies being written, cancellations, and non-renewals. It is also important to examine data on losses by neighborhoods within existing rating territories because marked discrepancies within territories would cast doubt on the validity of territorial boundaries. Yet, not even a fifth of the States collect anything other than loss data, and that data is gathered on a territorywide basis.”
|
||
|
||
3.2 From Categorical to Continuous Models
|
||
|
||
71
|
||
|
||
In Box 3.3, a paragraph from Casey et al. (1976) provides some historical perspective, by Barbara Casey, Jacques Pezier and Carl Spetzler.
|
||
|
||
Box 3.3 Historical perspective, Casey et al. (1976) “On the other hand, the opinion that distinctions based on sex, or any other group variable, necessarily violate individual rights reflects ignorance of the basic rules of logical inference in that it would arbitrarily forbid the use of relevant information. It would be equally fallacious to reject a classification system based on socially acceptable variables because the results appear discriminatory. For example, a classification system may be built on the use of a car, mileage, merit rating, and other variables, excluding sex. However, when verifying the average rates according to sex one may discover significant differences between males and females. Refusing to allow such differences would be attempting to distort reality by choosing to be selectively blind. The use of rating territories is a case in point. Geographical divisions, however designed, often correlate with socio-demographic factors such as income level and race because of natural aggregation or forced segregation according to these factors. Again, we conclude that insurance companies should be free to delineate territories and assess territorial differences as well as they can. At the same time, insurance companies should recognize that it is in their best interest to be objective and use clearly relevant factors to define territories lest they be accused of invidious discrimination by the public. (.. . .) One possible standard does exist for exception to the counsel that particular rating variables should not be proscribed. What we have called ‘equal treatment’ standard of fairness may precipitate a societal decision that the process of differentiating among individuals on the basis of certain variables is discriminatory and intolerable. This type of decision should be made on a specific, statutory basis. Once taken, it must be adhered to in private and public transactions alike and enforced by the insurance regulator. This is, in effect, a standard for conduct that by design transcends and preempts economic considerations. Because it is not applied without economic cost, however, insurance regulators and the industry should participate in and inform legislative deliberations that would ban the use of particular rating variables as discriminatory.”
|
||
|
||
3.2.3 Mathematics of Rating Classes
|
||
As mentioned in Sect. 2.5, an important theorem when modeling heterogeneity is the variance decomposition property, or “law of total variance” (corresponding to
|
||
|
||
72
|
||
|
||
3 Models: Overview on Predictive Models
|
||
|
||
the Pythagorean Theorem, see Proposition 2.3),
|
||
|
||
.Var[Y ] = E[Var[Y | ]] + Var[E[Y | ]] .
|
||
within variance between variance
|
||
Here, the variance of the outcome Y is decomposed into two parts, one representing the variance due to the variability of the underlying risk factor . , and one reflecting the inherent variability of Y if . did not vary (the homogeneous case). One can recognize that a similar idea is the basis for analysis of variance (ANOVA) models (as formalized in Fisher (1921) and Fisher and Mackenzie (1923)) where the total variability is split into the “within groups” and the “between groups.” The “one-way ANOVA” is a technique that can be used to compare whether or not the means of two (or more) samples are significantly different. If the outcome y is continuous (extensions can be obtained for binary variables, or counts), suppose that
|
||
|
||
.yi,j = μj + εi,j ,
|
||
where i is the index over individuals, and j the index over groups (with .j = 1, 2, · · · , J ). .μj is the mean of the observations for group j , and errors .εi,j are supposed to be zero-mean (normally distributed as a classical assumption). One could also write
|
||
|
||
.yi,j = μ + αj + εi,j , where α1 + α2 + · · · + αJ = 0,
|
||
|
||
where .μ is the overall mean, whereas .αj is the deviation from the overall mean, for group j . Of course, one can generalize that model to multiple factors. In the “two-way ANOVA,” two types of groups are considered
|
||
|
||
.yi,j,k = μj,k + εi,j,k,
|
||
|
||
where j is the index over groups according to the first factor, whereas k is the index over groups according to the second factor. .μj,k is the mean of the observations for groups j and k, and errors .εi,j,k are supposed to be zero-mean. We can write the mean as a linear combination of factors, in the sense that
|
||
|
||
.yi,j,k = μ + αj + βk + γj,k +εi,j,k,
|
||
=μj,k
|
||
where .μ is still the overall mean, whereas .αj and .βj correspond to the deviation from the overall mean, and .γi,k is the non-additive interaction effect. In order to have identifiability of the model, some “sum-to-zero” constraints are added, as previously,
|
||
|
||
J
|
||
|
||
J
|
||
|
||
K
|
||
|
||
K
|
||
|
||
. αj = γj,k = 0 and βk = γj,k = 0.
|
||
|
||
j =1
|
||
|
||
j =1
|
||
|
||
k=1
|
||
|
||
k=1
|
||
|
||
3.2 From Categorical to Continuous Models
|
||
|
||
73
|
||
|
||
A more modern way to consider those models is to use linear models. For example, for the “one-way ANOVA,” we can write .y = Xβ + ε, where
|
||
|
||
y = (y1,1, · · · , yn1,1, y1,2, · · · , yn2,2, · · · , y1,J , · · · , ynJ ,J )
|
||
.
|
||
ε = (ε1,1, · · · , εn1,1, ε1,2, · · · , yn2,2, · · · , ε1,J , · · · , εnJ ,J )
|
||
|
||
.β = (β0, β1, · · · , βJ ) and
|
||
|
||
⎛
|
||
|
||
⎞
|
||
|
||
⎛
|
||
|
||
⎞
|
||
|
||
1n1 0 · · · 0
|
||
|
||
1n1 1n1 0 · · · 0
|
||
|
||
.X
|
||
|
||
=
|
||
|
||
[1n, A]
|
||
|
||
where
|
||
|
||
A
|
||
|
||
=
|
||
|
||
⎜⎜⎜⎝
|
||
|
||
0 ...
|
||
|
||
1n2 · · · ...
|
||
|
||
0 ...
|
||
|
||
⎟⎟⎟⎠ ,
|
||
|
||
and X = ⎜⎜⎜⎝1n... 2
|
||
|
||
0 ...
|
||
|
||
1n2 · · · ...
|
||
|
||
0 ...
|
||
|
||
⎟⎟⎟⎠
|
||
|
||
0 0 · · · 1nJ
|
||
|
||
1nJ 0 0 · · · 1nJ
|
||
|
||
are respectively .n × J and .n × (J + 1) matrices. In the first approach, .y = Aβ + ε, and the ordinary least squares estimate is
|
||
|
||
.β = (A
|
||
|
||
A)−1A
|
||
|
||
y
|
||
|
||
=
|
||
|
||
(y·1, · · ·
|
||
|
||
, y·J )
|
||
|
||
∈
|
||
|
||
RJ ,
|
||
|
||
where
|
||
|
||
y ·j
|
||
|
||
=
|
||
|
||
1 nj
|
||
|
||
nj
|
||
yi,j ,
|
||
i=1
|
||
|
||
so that .μj = y·j is simply the average within group j . In the second case, if .yi,j = μ + αj + εi,j , where .α1 + α2 + · · · + αJ = 0, we can prove that
|
||
|
||
.μ = y = 1 J
|
||
|
||
J
|
||
|
||
y·j and αj = y·j − y,
|
||
|
||
j =1
|
||
|
||
where the estimator of .μ is the average of the group averages. An alternative is to change the constraint slightly, so that .n1α1 + n2α2 + · · · + nJ αJ = 0, and in that case
|
||
|
||
.μ
|
||
|
||
=
|
||
|
||
y
|
||
|
||
=
|
||
|
||
1 n
|
||
|
||
J
|
||
nj y·j
|
||
j =1
|
||
|
||
and
|
||
|
||
βj
|
||
|
||
=
|
||
|
||
y ·j
|
||
|
||
− y.
|
||
|
||
Let .j ∈ {1, 2, · · · , J } and .k ∈ {1, 2, · · · , K}, and let .nj,k denote the number of observations in group j for the first factor, and k for the second. Define averages
|
||
|
||
1 njk
|
||
|
||
1 K njk
|
||
|
||
1 J njk
|
||
|
||
.y·j k
|
||
|
||
=
|
||
|
||
nj k
|
||
|
||
yij k ,
|
||
i=1
|
||
|
||
y·j ·
|
||
|
||
=
|
||
|
||
nj ·
|
||
|
||
yij k ,
|
||
k=1 i=1
|
||
|
||
and y··k
|
||
|
||
=
|
||
|
||
n·k
|
||
|
||
yij k .
|
||
j =1 i=1
|
||
|
||
The model is here
|
||
|
||
.yijk = μ + αj + βk + γjk + εijk,
|
||
|
||
74
|
||
|
||
3 Models: Overview on Predictive Models
|
||
|
||
which we can write, using .(J + K + J K) indicators (vectors in dimension n, with respectively, in each block, .nj·, .n·k and .njk ones), as a classical regression problem. As previously, under identifiability assumptions, it is possible to have interpretable estimates to those quantities.
|
||
.μ = y, αj = yj· − y, βk = y·,k − y,
|
||
.γjk = yjk − yj· − y·k + y.
|
||
Without the non-additive interaction effect, the model becomes
|
||
.yijk = μ + αj + βk + εijk.
|
||
Such models were used historically in claims reserving (see Kremer (1982) for a formal connection), and, of course, in ratemaking. As explained in Bennett (1978), “in a rating structure used in motor insurance there may typically be about eight factors, each having a number of different levels into which risks may be classified and then be charged different rates of premium,” with either an “additive model” or a “multiplicative model” for the premium .μ (with notations of Bennett (1978)),
|
||
|
||
μjk··· = μ + αj + βk + · · · additive,
|
||
.
|
||
μjk··· = μ · αj · βk · · · multiplicative,
|
||
where .αj is a parameter value for the i-th level of the first risk factor, etc., and .μ is a constant corresponding to some “overall average level.”
|
||
Historically, classification relativities were determined one dimension at a time (see Feldblum and Brosius (2003), and the appendices to McClenahan (2006) and Finger (2006) for some illustration of the procedure). Then, Bailey and Simon (1959, 1960) introduced the “minimum bias procedure.”
|
||
In Fig. 3.2, we can visualize a dozen classes associated with credit risk (on the GermanCredit database), with on the x-axis, predictions given by two models, and the empirical default probability on the y-axis (that will correspond to a discrete version of the calibration plot described in Sect. 4.3.3).
|
||
As discussed in Agresti (2012, 2015), there are strong connections between those approaches based on groups and linear models, and actuarial research started to move toward “continuous” models. Nevertheless, the use of categories has been popular in the industry for several decades. For example, Siddiqi (2012) recommends cutting all continuous variables into bins, using a so-called “weightof-evidence binning” technique, usually seen as an “optimal binning” for numerical and categorical variables using methods including tree-like segmentation, or Chisquared merge. In R, it can be performed using the woebin function of the scorecard package. For example, on the GermanCredit dataset, three continuous variables are divided into bins, as in Fig. 3.3. For the duration (in months), bins are A = .[0, 8), B = .[8, 16), C = .[16, 34), D = .[34, 44), and E = .[44, 80); for
|
||
|
||
3.2 From Categorical to Continuous Models
|
||
|
||
75
|
||
|
||
Fig. 3.2 Scatterplot with predictions .y on various groups, and average outcomes .y, on the database GermanCredit, with the logistic regression (GLM) on the left and the random forest (RF) on the right. Here, y corresponds to the indicator of having a bad risk. Size of circles are proportional to size of groups
|
||
Fig. 3.3 From continuous variables to categories (five categories .{A, B, C, D, E}), for three continuous variables of the GermanCredit dataset: duration of the credit, amount of credit, and age of the applicant. Bars in the background are the number of applicants in each bin (y-axis on the left), and the line is the probability of having a default (y-axis on the right)
|
||
the credit amount, bins are A = .[0, 1400), B = .[1400, 1800), C = .[1800, 4000), D = .[4000, 9200), B = .[9200, 20000); and for the age of the applicant, A = .[19, 26), B = .[26, 28), C = .[28, 35), D = .[35, 37), and E = .[37, 80). The use of categorical features, to create ratemaking classes is now obsolete, as more and more actuaries consider “individual” pricing models.
|
||
|
||
76
|
||
3.2.4 From Classes to Score
|
||
|
||
3 Models: Overview on Predictive Models
|
||
|
||
Instead of considering risk classes, the measurement of risk can take a very different form, which we could call “individualization”, or “personalization”, as in Barry and Charpentier (2020). In many respects, the latter is a kind of asymptotic limit of the asymptotic limit of the first one, when the number of classes increases. By significantly reducing the population through the assignment of individuals to exclusive categories, and ensuring that each category consists of a single individual, the processes of “categorization” and “individualization” begin to converge. Besides computational aspects (discussed in the next section), this approach is fundamentally altering the logical properties of reasoning, as discussed in François (2022) and Krippner (2023). Individuals are given a very precise “score” (which of course can be shared with others). These scores are not discrete, discontinuous, qualitative categories, but numbers that we can consequently engage in calculations (as explained in the previous chapter). When individualized measures are employed, they are situated on a continuous scale: individuals are assigned highly precise scores, which, of course, occasionally and at intervals, may be shared with others but generally enable the ranking of individuals in relation to one another. These scores are no longer discrete, discontinuous, qualitative categories, but rather numerical values that can, therefore, be subjected to calculations. Furthermore, these numbers possess cardinal value in the sense that they not only facilitate the ranking of risks in comparison with one another but also denote a quantity (of risk) amenable to reasoning and notably computation. Last, probabilities can be associated with these numbers, which are not the property of a group but that of an individual: risk measurement is no longer intended to designate the probability of an event occurring within a group once in a thousand trials; it is aimed at providing a quantified assessment of the default risk associated with a specific individual, in their idiosyncrasy and irreducible singularity. Risk measurement has now evolved into an individualized measure, François (2022) claim. Thanks to those scores, individual policyholders are directly comparable. As Friedler et al. (2016) explained, “the notion of the group ceases to be a stable analytical category and becomes a speculative ensemble assembled to inform a decision and enable a course of action (.. . .) Ordered for a different purpose, the groups scatter and reassemble differently.” In the next section, we present techniques used by actuaries to model risks, and compute premiums.
|
||
|
||
3.3 Supervised Models and “Individual” Pricing
|
||
If we are going to talk here mainly about insurance pricing models, i.e., supervised models where the variable of interest y is the occurrence of a claim in the coming year, the number of claims, or the total charge, it is worth keeping in mind that the input data (.x) can be the predictions of a model. For example .x1 could be
|
||
|
||
3.3 Supervised Models and “Individual” Pricing
|
||
|
||
77
|
||
|
||
an observed acceleration score from the previous year (computed by an external provider who had access to the raw telematics data), .x2 could be the distance to the nearest fire station (extrapolated from distance-to-address software), .x3 can be the age of the policyholder, .x4 could be a prediction of the number of kilometers driven, etc. (in Chap. 5, we discuss more predictive variables used by actuaries). In the training dataset, the “past observations” .yi can also be predictions, especially if we want to use recent claims, still pending, but where the claims manager can give an estimate, based on human experts but also possibly on opaque algorithms. We can think of those applications that give a cost estimate of a car damage claim based on a photo of the vehicle sent by the policyholder, or the use of compensation scales for claims not yet settled.
|
||
As we see in this section, a natural “model” or “predictor” for a variable y is related to the conditional expected value. If y corresponds to the total annual loss associated with a given policy, we have seen in the previous chapter that .E[Y ] was the “homogeneous pure premium” (see Definition 2.4) and .E[Y |X] corresponds to the “heterogeneous pure premium” (see Definition 2.7). In the classical collective risk model, .Y = Z1 + · · · + ZN is a compound sum, a random sum of random individual losses, and under standard assumptions (see Denuit and Charpentier (2004) or Denuit et al. (2007)), .E[Y |X] = E[N |X] · E[Z|X], where the first term .E[N |X] is the expected annual claim frequency for a policyholder with characteristics .X, whereas .E[Z|X] is the average cost of a single claim. In this chapter, quite generally, y is one of those variables of interest used to calculate a premium.
|
||
If x and y are discrete variables, recall that
|
||
|
||
.E[Y |X = x] = y P[Y = y|X = x].
|
||
y
|
||
|
||
Quite naturally, in the absolutely continuous case, we would write
|
||
|
||
.E[Y |X = x] =
|
||
|
||
y f (y|x)dy =
|
||
|
||
f (x, y)
|
||
|
||
y
|
||
|
||
dy,
|
||
|
||
f (x)
|
||
|
||
with standard notation. Those functions are interesting as we have the following decomposition
|
||
|
||
.Y = E[Y |X = x] + Y − E[Y |X = x] ,
|
||
=ε
|
||
where .E[ε|X = x] = 0. It should be stressed that the extension to the case where X is absolutely continuous is formally slightly complicated since .{X = x} is an event with probability 0, and then .P[Y ∈ A|X = x] is not properly defined (in Bayes
|
||
|
||
78
|
||
|
||
3 Models: Overview on Predictive Models
|
||
|
||
formula, in Proposition 3.2). As stated in Kolmogorov (1933),3 “der Begriff der bedingten Wahrscheinlichkeit in Bezug auf eine isoliert gegebene Hypothese, deren Wahrscheinlichkeit gleich Null ist, unzulässig.” Billingsley (2008), Rosenthal (2006) or Resnick (2019) provide theoretical functions for the notation “.E[Y |X = x],”
|
||
Definition 3.1 (Regression Function .μ) Let Y be the non-negative random variable of interest, observed with covariates .X, the regression function is .μ(x) = E[Y |X = x].
|
||
Without going into too much detail (based on measure theory), we will invoke here the “law of the unconscious statistician” (as coined in Ross (1972) and Casella and Berger (1990)), and write
|
||
|
||
.E [ϕ(Y )] = ϕ Y (ω) P(dω) = ϕ(y)PY (dy),
|
||
R
|
||
|
||
for some random variable .Y : ( , F, P) → R, with law .PY . And we will take even more freedom when conditioning. As discussed in Proschan and Presnell (1998), “statisticians make liberal use of conditioning arguments to shorten what would otherwise be long proofs,” and we do the same here. Heuristically, (the proof can be found in Pfanzagl (1979) and Proschan and Presnell (1998)), a version of .P(Y ∈ A|X = x) can be obtained as a limit of conditional probabilities given that X lies in small neighborhoods of x, the limit being taken as the size of the neighborhood tends to 0,
|
||
|
||
.P Y ∈ A X = x
|
||
|
||
=
|
||
|
||
lim
|
||
→0
|
||
|
||
P({Y
|
||
|
||
∈ A} ∩ {|X − P({|X − x| ≤
|
||
|
||
x| ≤ })
|
||
|
||
})
|
||
|
||
= lim P Y ∈ A |X − x| ≤ ,
|
||
→0
|
||
|
||
that can be extended into a higher dimension, using some distance between .X and .x, and then use that approach to define4 “.E[Y |X = x].” In Sect. 4.1, we have a brief
|
||
discussion about a related problem, which is the distinction between .E[ϕ(x1, X2)] and .E[ϕ(X1, X2)|X1 = x1].
|
||
|
||
3.3.1 Machine-Learning Terminology
|
||
Suppose that random variables .(X, Y ) are defined on a probabilistic space .( , F, P), and we observe a finite sample .(x1, y1), · · · , (xn, yn). Based on that
|
||
3 “the notion of conditional probability is inadmissible in relation to a hypothesis given in isolation whose probability is zero” [personal translation]. 4 Which corresponds to a very standard idea in non-parametric statistics, see Tukey (1961), Nadaraya (1964) or Watson (1964).
|
||
|
||
3.3 Supervised Models and “Individual” Pricing
|
||
|
||
79
|
||
|
||
sample, we want to estimate, or learn, a model m that is a good approximation of the unobservable regression function .μ, where .μ(x) = E[Y |X = x].
|
||
In the specific case where y is a categorical variable, for example, a binary variable (taking here values in .{0, 1}), there is strong interest in the machine-learning literature not to estimate the regression function .μ, but to construct a “classifier” that predicts the class. For example, in the logistic regression (see Sect. 3.3.2), we suppose that .(Y |X = x) ∼ B(μ(x)), where .logit[μ(x)] = x β, and .μ(x) has two interpretations, as .μ(x) = E[Y |X = x] and .μ(x) = P[Y = 1|X = x]. From this regression function, one can easily construct a “classifier” by considering .mt (x) = 1(m(x) > t), taking values in .{0, 1} (like y), for some appropriate cutoff threshold .t ∈ [0, 1].
|
||
Definition 3.2 (Loss . ) A loss function . is a function defined on .Y × Y such that . (y, y ) ≥ 0 et . (y, y) = 0.
|
||
A loss is not necessarily a distance (between y and .y ) as symmetry is not required, and nor is the triangle inequality. Some losses are simply a function (called “cost”) of some distance between y and .y .
|
||
Definition 3.3 (Risk .R) For a fitted model .m, its risk is
|
||
|
||
.R(m) = E (Y, m(X)) .
|
||
|
||
For instance, in a regression problem, a quadratic loss function . 2 is used . 2(y, y) = (y − y)2,
|
||
and the risk (named “quadratic risk”) is then .R2(m) = E (Y − m(X))2 ,
|
||
|
||
where .m(x) is some prediction. Observe that
|
||
|
||
.E[Y ] = argmin R2(m) = argmin E 2 (Y, m) .
|
||
|
||
m∈R
|
||
|
||
m∈R
|
||
|
||
The fact that the expected value minimizes the expected loss for some loss function (here . 2) is named “elicitable” in Gneiting et al. (2007). From this property, we can understand why the expected value is also called “best estimate” (see also the connection to Bregman distance, in Definition 3.12). As discussed in Huttegger (2013), the use of a quadratic loss function gives rise to a rich geometric structure, for variables that are squared integrable, which is essentially very close to the geometry of Euclidean spaces (.L2 being a Hilbert space, with an inner product, and a projection operator; we come back to this point in Chap. 10, in “pre-processing” approaches). Up to a monotonic transformation (the square root function), the distance here is the expectation of the quadratic loss function. With the terminology
|
||
|
||
80
|
||
|
||
3 Models: Overview on Predictive Models
|
||
|
||
of Angrist and Pischke (2009), the regression function .μ is the function of .x that serves as “the best predictor of y, in the mean-squared error sense.”
|
||
The quantile loss . q,α, for some .α ∈ (0, 1) is
|
||
|
||
. q,α(y, y) = max α(y − y), (1 − α)(y − y) = (y − y) α − 1(y<y) .
|
||
|
||
For example, Kudryavtsev (2009) used a quantile loss function in the context of ratemaking. It is called “quantile” loss as
|
||
|
||
.Q(α) = F −1(α) ∈ argmin Rq,α(q) = argmin E q,α (Y, q) .
|
||
|
||
q ∈R
|
||
|
||
q ∈R
|
||
|
||
Indeed,
|
||
|
||
q
|
||
|
||
∞
|
||
|
||
.argmin Rq,α(q) = argmin (α − 1) (y − q)dFY (y) + α (y − q)dFY (y) ,
|
||
|
||
m
|
||
|
||
m
|
||
|
||
−∞
|
||
|
||
q
|
||
|
||
and by computing the derivative of the expected loss via an application of the Leibniz integral rule,
|
||
|
||
q
|
||
|
||
∞
|
||
|
||
.0 = (1 − α) dFY (y) − α dFY (y),
|
||
|
||
−∞
|
||
|
||
q
|
||
|
||
so that .0 = FY (q ) − α. Thus, quantiles are also “elicitably” functional. When .α = 1/2 (the median), we recognize the least absolute deviation loss . 1, . 1(y, y) = |y − y|.
|
||
Definition 3.4 (Empirical Risk .Rn) Given a sample .{(yi, xi), i = 1, · · · , n}, define the empirical risk
|
||
|
||
.Rn(m)
|
||
|
||
=
|
||
|
||
1 n
|
||
|
||
n
|
||
|
||
(m(xi ), yi ) .
|
||
|
||
i=1
|
||
|
||
Again, in the regression context, with a quadratic loss function, the empirical risk is the mean squared error (MSE), defined as
|
||
|
||
1 .Rn(m) = MSEn = n
|
||
|
||
n
|
||
|
||
(yi − m(xi ))2 .
|
||
|
||
i=1
|
||
|
||
Note that .m, defined as the empirical risk minimizer, over a training sample and a collection of models, is also called M-estimator in Huber (1964).
|
||
|
||
3.3 Supervised Models and “Individual” Pricing
|
||
|
||
81
|
||
|
||
In the context of a classifier, where .y ∈ {0, 1} as well as .y, a natural loss is the so-called “.0/1 loss,”
|
||
|
||
. 0/1(y, y) = 1(y = y) =
|
||
|
||
1 if y = y 0 if y = y.
|
||
|
||
In the context of a classifier, the loss is a function on .Y × Y, i.e., .{0, 1} × {0, 1}, taking values in .R+. But in many cases, we want to compute a “loss” between y and an estimation of .P[Y = 1], instead of a predicted class .y ∈ {0, 1}, therefore, it will be a function defined on .{0, 1} × [0, 1]. That will correspond to a “scoring rule” (see Definition 4.16). The empirical risk associated with the . 0/1 loss is the proportion of misclassified individuals, also named “classifier error rate.” But it is possible to get more information: given a sample of size n, it is possible to compute the “confusion matrix,” which is simply the contingency table of the pairs .(yi, yi), as in Figs. 3.4 and 3.5 .
|
||
Given a threshold t, one will get the confusion matrix, and various quantities can be computed. To illustrate, consider a simple logistic regression model, on .x (and not s), and get predictions on .n = 40 observations from toydata2 (as in Table 8.1). To illustrate, two values are considered for t, .30% and .50%.
|
||
|
||
Fig. 3.4 General representation of the “confusion matrix,” with counts of .y = 0 (column on the left, .n•0) and .y = 1 (column on the right, .n•1), counts of .y = 0, “negative” outcomes (row on top, .n0•) and .y = 1, “positive” outcomes (row below, .n1•)
|
||
Fig. 3.5 Expressions of the standard metrics associated with the “confusion matrix” (false positive rate, true positive rate, false negative rate, true negative rate, positive predictive value and negative predictive value)
|
||
|
||
prediction
|
||
|
||
prediction
|
||
|
||
actual value
|
||
|
||
0
|
||
|
||
1
|
||
|
||
0 true negative TN = 00
|
||
|
||
false negative FN = 01
|
||
|
||
total
|
||
0•
|
||
|
||
1 false positive true positive
|
||
|
||
1•
|
||
|
||
FP = 10
|
||
|
||
TP = 11
|
||
|
||
total
|
||
|
||
•0
|
||
|
||
•1
|
||
|
||
actual value
|
||
|
||
0
|
||
|
||
1
|
||
|
||
0 true negative TN = 00
|
||
|
||
false negative FN = 01
|
||
|
||
1 false positive FP = 10
|
||
|
||
true positive TP = 11
|
||
|
||
total TN
|
||
NPV = FN + TN
|
||
TP PPV = FP + TP
|
||
|
||
FP
|
||
|
||
TP
|
||
|
||
FPR = FP + TN TPR = TP + FN
|
||
|
||
TN
|
||
|
||
FN
|
||
|
||
TNR = FP + TN FNR = TP + FN
|
||
|
||
82
|
||
|
||
3 Models: Overview on Predictive Models
|
||
|
||
actual value
|
||
|
||
0
|
||
|
||
1
|
||
|
||
total
|
||
|
||
0 true negative false negative 15
|
||
|
||
12
|
||
|
||
3
|
||
|
||
actual value
|
||
|
||
0
|
||
|
||
1
|
||
|
||
total
|
||
|
||
0 true negative false negative 23
|
||
|
||
18
|
||
|
||
5
|
||
|
||
prediction prediction
|
||
|
||
1 false positive true positive
|
||
|
||
25
|
||
|
||
8
|
||
|
||
17
|
||
|
||
1 false positive true positive
|
||
|
||
17
|
||
|
||
2
|
||
|
||
15
|
||
|
||
total
|
||
|
||
20
|
||
|
||
20
|
||
|
||
total
|
||
|
||
20
|
||
|
||
20
|
||
|
||
Fig. 3.6 Confusion matrices with threshold .30% and .50% for .n = 40 observations from the toydata2 dataset, and a logistic regression for m
|
||
|
||
From Fig. 3.6, we can compute various quantities (as explained in Figs. 3.4 and 3.5). Sensitivity (true positive rate) is the probability of a positive test result, conditioned on the individual truly being positive. Thus, here we have
|
||
|
||
.TPR(30%)
|
||
|
||
=
|
||
|
||
3
|
||
|
||
17 + 17
|
||
|
||
=
|
||
|
||
0.85
|
||
|
||
and
|
||
|
||
TPR(50%)
|
||
|
||
=
|
||
|
||
5
|
||
|
||
15 + 15
|
||
|
||
=
|
||
|
||
0.75,
|
||
|
||
whereas the miss rate (false negative rate)
|
||
|
||
3
|
||
|
||
5
|
||
|
||
.FNR(30%) = 3 + 17 = 0.15 and FNR(50%) = 5 + 15 = 0.25.
|
||
|
||
Specificity (true negative rate) is the probability of a negative test result, conditioned on the individual truly being negative,
|
||
|
||
12
|
||
|
||
18
|
||
|
||
.TNR(30%) = 8 + 12 = 0.6 and TNR(50%) = 2 + 18 = 0.9,
|
||
|
||
whereas the fall-out (false positive rate) is
|
||
|
||
.FPR(30%)
|
||
|
||
=
|
||
|
||
8 8 + 12
|
||
|
||
=
|
||
|
||
0.4
|
||
|
||
and
|
||
|
||
FPR(50%)
|
||
|
||
=
|
||
|
||
2
|
||
|
||
2 + 18
|
||
|
||
=
|
||
|
||
0.1
|
||
|
||
The negative predictive value (NPV)
|
||
|
||
.NPV(30%)
|
||
|
||
=
|
||
|
||
12 12 + 3
|
||
|
||
=
|
||
|
||
0.8
|
||
|
||
and
|
||
|
||
NPV(50%)
|
||
|
||
=
|
||
|
||
18 18 +
|
||
|
||
5
|
||
|
||
=
|
||
|
||
0.7826,
|
||
|
||
whereas the precision (positive predictive value) is
|
||
|
||
17
|
||
|
||
15
|
||
|
||
.PPV(30%) = 17 + 8 = 0.68 and PPV(50%) = 15 + 2 = 0.8824.
|
||
|
||
3.3 Supervised Models and “Individual” Pricing
|
||
|
||
83
|
||
|
||
Accuracy is the proportion of good predictions
|
||
|
||
12 + 17 .ACC(30%) = 12 + 8 + 3 + 17 = 0.725 and
|
||
|
||
ACC(50%)
|
||
|
||
=
|
||
|
||
18
|
||
|
||
18 +2
|
||
|
||
+ +
|
||
|
||
15 5+
|
||
|
||
15
|
||
|
||
=
|
||
|
||
0.825,
|
||
|
||
whereas “balanced accuracy” (see Langford and Schapire 2005) is the average of the true positive rate (TPR) and the true negative rate (TNR),
|
||
|
||
.BACC(30%) = 0.85 + 0.6 = 0.725 and BACC(50%) = 0.75 + 0.9 = 0.8250.
|
||
|
||
2
|
||
|
||
2
|
||
|
||
Finally, Cohen’s kappa (from Cohen (1960), which is based on the accuracy assuming that y and .y are independent—as in the Chi-squared test),
|
||
|
||
.κ(30%) =
|
||
|
||
29 40
|
||
|
||
−
|
||
|
||
20 40
|
||
|
||
1
|
||
|
||
−
|
||
|
||
20 40
|
||
|
||
= 0.45 and κ(50%) =
|
||
|
||
33 40
|
||
|
||
−
|
||
|
||
20 40
|
||
|
||
1
|
||
|
||
−
|
||
|
||
20 40
|
||
|
||
= 0.65,
|
||
|
||
whereas Matthews correlation coefficient (see Definition 8.15) is
|
||
|
||
.MCC(30%) = 0.464758 and MCC(30%) = 0.6574383.
|
||
|
||
One issue here is that the sample used to compute the empirical risk is the same as the one used to fit the model, also-called “in-sample risk”
|
||
|
||
.Rins(m)
|
||
|
||
=
|
||
|
||
1 n
|
||
|
||
n
|
||
|
||
(m(xi ), yi ) .
|
||
|
||
i=1
|
||
|
||
Thus, if we consider
|
||
|
||
.mn = argmin Rins(m) , m∈M
|
||
|
||
on a set .M of admissible models, we will have a tendency to capture a lot of noise and to over-adjust the data: this is called “over-fitting.” For example, in Fig. 3.7, we have two fitted models .m such that the in-sample risk is null
|
||
|
||
.Rins(m)
|
||
|
||
=
|
||
|
||
1 n
|
||
|
||
n i=1
|
||
|
||
(m(xi), yi ) = 0.
|
||
|
||
84
|
||
|
||
3 Models: Overview on Predictive Models
|
||
|
||
Fig. 3.7 Two fitted models from a (fake) dataset .(x1, y1), · · · , (x10, y10), with a linear model on the left, and a polynomial model on the right, such that for both in-sample risk is null, .Rins(m) = 0
|
||
|
||
To avoid this problem, randomly divide the initial database into a training dataset and a validation dataset. The training database, with .nT < n observations, will be used to estimate the parameters of the model
|
||
|
||
.mnT = argmin RinsT (m) . m∈M
|
||
Then, the validation dataset, with .nV = n − nT observations, will be used to select the model, using the “out-of-sample risk”
|
||
|
||
.RonVs (mnT ) =
|
||
|
||
1 nV
|
||
|
||
nV i=1
|
||
|
||
mnT (xi ), yi .
|
||
|
||
Quite generally, given a loss function . : Y × Y → R+, and a collection .Dn of n independent observations drawn from .(X, Y ) (corresponding to the dataset) the risk
|
||
is
|
||
|
||
.EX EDn EY |X [ (Y, m(X)|Dn)] ,
|
||
|
||
that cannot be calculated without knowing the true distribution of .(Y, X). If . is the quadratic loss, . 2(y, y) = (y − y)2,
|
||
|
||
.R2(m) = EDn EY |X [ (Y, m(X)|Dn)] = EY |X(Y ) − EDn m(X)|Dn) 2 +EY |X Y − EY |X(Y ) 2
|
||
+EDn m(X)|Dn) − EDn m(X)|Dn) 2 .
|
||
|