DissLiteratur/storage/88KL3VC4/.zotero-ft-cache

Hang Lee
Foundations of Applied Statistical Methods
Second Edition

Foundations of Applied Statistical Methods

Hang Lee
Foundations of Applied Statistical Methods
Second Edition

Hang Lee Massachusetts General Hospital Biostatistics Center Department of Medicine Harvard Medical School Boston, MA, USA

ISBN 978-3-031-42295-9

ISBN 978-3-031-42296-6 (eBook)

https://doi.org/10.1007/978-3-031-42296-6

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2014, 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Paper in this product is recyclable.

Preface

Researchers who design and conduct experiments or sample surveys, perform data analysis and statistical inference, and write scientiﬁc reports need adequate knowledge of applied statistics. To build adequate and sturdy knowledge of applied statistical methods, ﬁrm foundation is essential. I have come across many researchers who had studied statistics in the past but are still far from being ready to apply the learned knowledge to their problem solving, and else who have forgotten what they had learned. This could be partly because the mathematical technicality dealt with their past study material was above their mathematics proﬁciency, or otherwise the studied worked examples often lacked addressing essential fundamentals of the applied methods. This book is written to ﬁll gaps between the traditional textbooks involving ample amount of technically challenging mathematical derivations and/or the worked examples of data analyses that often underemphasize fundamentals. The chapters of this book are dedicated to spell out and demonstrate, not to merely explain, necessary foundational ideas so that the motivated readers can learn to fully appreciate the fundamentals of the commonly applied methods and revivify the forgotten knowledge of the methods without having to deal with complex mathematical derivations or attempt to generalize oversimpliﬁed worked examples of plug-and-play techniques. Detailed mathematical expressions are exhibited only if they are deﬁnitional or intuitively comprehensible. Data-oriented examples are illustrated only to aid the demonstration of fundamentals. This book can be used as a guidebook for applied researchers or as an introductory statistical methods course textbook for the graduate students not majoring in statistics.

Boston, MA, USA

Hang Lee

v

Contents
1 Description of Data and Essential Probability Models . . . . . . . . . . . 1 1.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Description of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Description of Categorical Data . . . . . . . . . . . . . . . . . . 3 1.2.3 Description of Continuous Data . . . . . . . . . . . . . . . . . . 3 1.2.4 Stem-and-Leaf Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.5 Box-and-Whisker Plot . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.2 Central Tendency Descriptive Statistics for Quantitative Outcomes . . . . . . . . . . . . . . . . . . . . . . 10 1.3.3 Dispersion Descriptive Statistics for Quantitative Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.4 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.5 Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.6 Property of Standard Deviation After Data Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.7 Other Descriptive Statistics for Dispersion . . . . . . . . . . . 15 1.3.8 Dispersions Among Multiple Data Sets . . . . . . . . . . . . . 16 1.3.9 Caution to CV Interpretation . . . . . . . . . . . . . . . . . . . . . 18 1.4 Statistics for Describing Relationships Between Two Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.4.1 Linear Correlation Between Two Continuous Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.4.2 Contingency Table to Describe an Association Between Two Categorical Outcomes . . . . . . . . . . . . . . . 20 1.4.3 Odds Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
vii

viii

Contents

1.5 Two Essential Probability Distribution . . . . . . . . . . . . . . . . . . . 22 1.5.1 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.5.2 Probability Density Function of Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.5.3 Application of Gaussian Distribution . . . . . . . . . . . . . . . 25 1.5.4 Standard Normal Distribution . . . . . . . . . . . . . . . . . . . . 26 1.5.5 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2 Statistical Inference Concentrating on a Single Mean . . . . . . . . . . . 35 2.1 Population and Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.1.1 Sampling and Non-sampling Errors . . . . . . . . . . . . . . . . 35 2.1.2 Sample Distribution and Sampling Distribution . . . . . . . 37 2.1.3 Standard Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.1.4 Sampling Methods and Sampling Variability of the Sample Means . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.2 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.2.1 Data Reduction and Related Nomenclatures . . . . . . . . . . 42 2.2.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.2.3 The t-Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.2.4 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.2.5 Accuracy and Precision . . . . . . . . . . . . . . . . . . . . . . . . 56 2.2.6 Interval Estimation and Conﬁdence Interval . . . . . . . . . . 58 2.2.7 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.2.8 Study Design and Its Impact to Accuracy and Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3 t-Tests for Two-Mean Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.1 Independent Samples t-Test for Comparing Two Independent Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.1.1 Independent Samples t-Test When Variances Are Unequal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.1.2 Denominator Formulae of the Test Statistic for Independent Samples t-Test . . . . . . . . . . . . . . . . . . . 77 3.1.3 Connection to the Conﬁdence Interval . . . . . . . . . . . . . . 78 3.2 Paired Sample t-Test for Comparing Paired Means . . . . . . . . . . 78 3.3 Use of Excel for t-Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4 Inference Using Analysis of Variance (ANOVA) for Comparing Multiple Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.1 Sums of Squares and Variances . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2 F-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3 Multiple Comparisons and Increased Chance of Type 1 Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Contents

ix

4.4 Beyond Single-Factor ANOVA . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4.1 Multi-factor ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4.2 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4.3 Repeated Measures ANOVA . . . . . . . . . . . . . . . . . . . . 94 4.4.4 Use of Excel for ANOVA . . . . . . . . . . . . . . . . . . . . . . . 96
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5 Inference of Correlation and Regression . . . . . . . . . . . . . . . . . . . . . 99 5.1 Inference of Pearson’s Correlation Coefﬁcient . . . . . . . . . . . . . . 99 5.2 Linear Regression Model with One Independent Variable: Simple Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.3 Simple Linear Regression Analysis . . . . . . . . . . . . . . . . . . . . . 102 5.4 Linear Regression Models with Multiple Independent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.5 Logistic Regression Model with One Independent Variable: Simple Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . 108 5.6 Consolidation of Regression Models . . . . . . . . . . . . . . . . . . . . 111 5.6.1 General and Generalized Linear Models . . . . . . . . . . . . 111 5.6.2 Multivariate Analysis Versus Multivariable Model . . . . . 112 5.7 Application of Linear Models with Multiple Independent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.8 Worked Examples of General and Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.8.1 Worked Example of a General Linear Model . . . . . . . . . 113 5.8.2 Worked Example of a Generalized Linear Model (Logistic Model) Where All Multiple Independent Variables Are Dummy Variables . . . . . . . . . . . . . . . . . . 115 5.9 Measure of Agreement Between Outcome Pairs: Concordance Correlation Coefﬁcient for Continuous Outcomes and Kappa (κ) for Categorical Outcomes . . . . . . . . . . 116 5.10 Handling of Clustered Observations . . . . . . . . . . . . . . . . . . . . . 120 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6 Normal Distribution Assumption-Free Nonparametric Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.1 Comparing Two Proportions Using a 2 × 2 Contingency Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.1.1 Chi-Square Test for Comparing Two Independent Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.1.2 Fisher’s Exact Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.1.3 Comparing Two Proportions in Paired Samples . . . . . . . 131 6.2 Normal Distribution Assumption-Free Rank-Based Methods for Comparing Distributions of Continuous Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.2.1 Permutation Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.2.2 Wilcoxon’s Rank Sum Test . . . . . . . . . . . . . . . . . . . . . 135

x

Contents

6.2.3 Kruskal–Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.2.4 Wilcoxon’s Signed Rank Test . . . . . . . . . . . . . . . . . . . . 137 6.3 Linear Correlation Based on Ranks . . . . . . . . . . . . . . . . . . . . . 137 6.4 About Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . . 138 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7 Methods for Censored Survival Time Data . . . . . . . . . . . . . . . . . . . 141 7.1 Censored Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.2 Probability of Surviving Longer Than Certain Duration . . . . . . . 142 7.3 Statistical Comparison of Two Survival Distributions with Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8 Sample Size and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 8.1 Sample Size for Single Mean Interval Estimation . . . . . . . . . . . 147 8.2 Sample Size for Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . 148 8.2.1 Sample Size for Comparing Two Means Using Independent Samples z- and t-Tests . . . . . . . . . . . 148 8.2.2 Sample Size for Comparing Two Proportions . . . . . . . . . 152 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
9 Review Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.1 Review Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.1.1 Solutions for Review Exercise 1 . . . . . . . . . . . . . . . . . . 161 9.2 Review Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 9.2.1 Solutions for Review Exercise 2 . . . . . . . . . . . . . . . . . . 168
10 Statistical Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

Chapter 1
Description of Data and Essential Probability Models

This chapter portrays how to make sense of gathered data before performing formal statistical inference. The topics covered are types of data, how to visualize data, how to summarize data into a few descriptive statistics (i.e., condensed numerical indices), and introduction to some useful probability models.

1.1 Types of Data
Typical types of data arising from most studies fall into one of the following categories.
Nominal categorical data contain qualitative information and appear to discrete values that are codiﬁed into numbers or characters (e.g., 1 = case with a disease diagnosis, 0 = control; M = male, F = female, etc.).
Ordinal categorical data are semi-quantitative and discrete, and the numeric coding scheme is to order the values such as 1 = mild, 2 = moderate, and 3 = severe. Note that the value of 3 (severe) does not necessarily be three times more severe than 1 (mild).
Count (number of events) data are quantitative and discrete (i.e., 0, 1, 2 . . .). Interval scale data are quantitative and continuous. There is no absolute 0, and the reference value is arbitrary. Examples of such data are temperature values in °C and °F. Ratio scale data are quantitative and continuous, and there is absolute 0; e.g., body weight and height. In most cases, the types of data usually fall into the above classiﬁcation scheme shown in Table 1.1 in that the types of data can be classiﬁed into either quantitative or qualitative, and discrete or continuous. Nonetheless, some deﬁnition of the data type may not be clear, among which the similarity and dissimilarity between the ratio scale and interval scale may be the ones that need further clariﬁcation.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023

1

H. Lee, Foundations of Applied Statistical Methods,

https://doi.org/10.1007/978-3-031-42296-6_1

2

1 Description of Data and Essential Probability Models

Table 1.1 Classiﬁcation of data types

Discrete

Qualitative
Nominal categorical (e.g., M = male, F = female)

Continuous N/A

Quantitative
Ordinal categorical (e.g., 1 = mild, 2 = moderate, 3 = severe) Count (e.g., number of incidences 0, 1, 2, 3, . . .)
Interval scale (e.g., temperature) Ratio scale (e.g., weight)

Ratio scale: If two distinct values of quantitative data were able to be represented by a ratio of two numerical values, then such data are ratio scale data. For example, two observations xi = 200 and xj = 100, for i ≠ j; the ratio xi/xj = 2 shows that xi is twice of xj, for example, lung volume, age, disease duration, etc.
Interval scale: If two distinct values of quantitative data were not ratio-able, then such data are interval scale data. Temperature is a good example as it has three temperature systems, i.e., Fahrenheit, Celsius, and Kelvin. Kelvin system also has its absolute 0 (there is no negative temperature in Kelvin system). For example, 200 °F is not a temperature that is twice higher than 100 °F. We can only say that 200 °F is higher by 100 degrees (i.e., the displacement between 200 and 100 is 100 degrees in Fahrenheit measurement scale).

1.2 Description of Data
1.2.1 Distribution
A distribution is a complete description of how high the occurring chance (i.e., probability) of a unique datum or certain range of data values is. The following two explanations will help you grasp the concept. If you keep on rolling a die, you will expect to observe 1, 2, 3, 4, 5, or 6 equally likely, i.e., a probability for each unique outcome value is 1/6. We say, probability of 1/6 is distributed to 1, 1/6 is to 2, 1/6 to 3, 1/6 to 4, 1/6 to 5, and 1/6 to 6. Another example is that if you keep on rolling a die many times, and each time you say a success if the observed outcome is 5 or 6 and say a failure otherwise, then your expected chance to observe a success will be 1/3 and that of a failure will be 2/3. We say, a probability of 1/3 is distributed to the success, and 2/3 is distributed to the failure. There are many distributions that cannot be described as simply as these two examples, which require descriptions using sophisticated mathematical functions.
Let us discuss how to describe the distributions arising from various types of data. One way to describe a set of collected data is to describe the distribution of relative frequency for the observed individual values (e.g., what values are very common and what values are how less common). Graphs, simple tables, or a few summary numbers are commonly used.

1.2 Description of Data

3

1.2.2 Description of Categorical Data

A simple tabulation (frequency table) is to list the observed count (and proportion in percentage value) for each category. A bar chart (see Figs. 1.1 and 1.2) can be used for a visual summary of nominal and ordinal outcome distributions. The size of each bar in Figs. 1.1 and 1.2 reveals the actual counts. It is also common to present it as the relative frequency (i.e., proportion of each category in percentage of the total).

1.2.3 Description of Continuous Data
Figure 1.3 is a list of white blood cell (WBC) counts of 31 patients diagnosed with a certain illness listed by the patient identiﬁcation number. Does this listing itself tell us the group characteristics such as the average and the variability among patients?
How can we describe the distribution of these data, i.e., how much of the occurring chance is distributed to WBC = 5200, how much to WBC = 3100, . . .,

Fig. 1.1 Frequency table and bar chart for describing nominal categorical data

4

1 Description of Data and Essential Probability Models

Fig. 1.2 Frequency table and bar chart for describing ordinal data
etc.? Such a description may be very cumbersome. As depicted in Fig. 1.4, the listed full data in ascending order can be a primitive way to describe the distribution, but it does not still describe the distribution. An option is to visualize the relative frequencies for grouped intervals of the observed data. Such a presentation is called histogram. To create a histogram, one will ﬁrst need to create equally spaced WBC categories and count how many observations fall into each category. Then the bar graph can be drawn where each bar size indicates the relative frequency of that speciﬁc WBC interval. The process of drawing bar graphs manually seems cumbersome. Next section introduces a much less cumbersome manual technique to visualize continuous outcomes.
1.2.4 Stem-and-Leaf Plot
The stem-and-leaf plot requires much less work than creating the conventional histogram while providing the same information as what the histogram does. This is a quick and easy option to sketch a continuous data distribution.

1.2 Description of Data

5

Fig. 1.3 List of WBC raw data of 31 subjects

Let us use a small data set for illustration, and then revisit our WBC data example for more discussion after this method becomes familiar to you. The following nine data points 12, 32, 22, 28, 26, 45, 32, 21, and 85 are ages (ratio scale) of a small group. Figures 1.5, 1.6, 1.7, 1.8 and 1.9 demonstrates how to create the stem-andleaf plot of these data.
The main idea of this technique is a quick sketch of the distribution of an observed data set without computational burden. Let us just take each datum in the order that it is recorded (i.e., the data are not preprocessed by other techniques such as sorting by ascending/descending order) and plot one value at a time (see Fig. 1.5). Note that the oldest observed age is 85 years, which is much greater than the next oldest age 45 years, and the unobserved stem interval values (i.e., 50s, 60s, and 70s) are placed. The determination of the number of equally spaced major intervals (i.e., number of stems) can be subjective and data range dependent.
Figure 1.10 depicts the distribution of our WBC data set by the stem-and-leaf plot. Most values lie between 3000 and 4000 (i.e., mode); the contour of the frequency distribution is skewed to the right, and the mean value did not describe the central location well; the smallest and the largest observations were 1800 and 11,200, respectively, and there are no observed values lying between 1000 and 1100.

6
Fig. 1.4 List of 31 individual WBC values in ascending order

1 Description of Data and Essential Probability Models

1.2.5 Box-and-Whisker Plot
Unlike the stem-and-leaf plot, this plot does not show the individual data values explicitly. This can describe the data sets whose sample sizes are larger than what can usually be illustrated manually by the stem-and-leaf plot. If the stem-and-leaf plot is seen from a bird-eye point of view (Fig. 1.11), then the resulting description can be made as depicted in the right-hand side panels of Figs. 1.12 and 1.13.
The unique feature of this technique is to identify and visualize where the middle half of the data exist (i.e., the interquartile range) by the box and the interval where the rest of the data exist by the whiskers.
If there are two or more modes, the box-and-whisker plot cannot fully characterize such a phenomenon, but the stem-and-leaf can (see Fig. 1.14).

1.2 Description of Data

7

Fig. 1.5 Step-by-step illustration of creating a stem-and-leaf plot Fig. 1.6 Illustration of creating a stem-and-leaf plot
Fig. 1.7 Two stem-and-leaf plots describing two same data sets

8

1 Description of Data and Essential Probability Models

Fig. 1.8 Common mistakes in stem-and-leaf plot Fig. 1.9 Two stem-and-leaf plots describing the same data sets by ascending and descending orders
1.3 Descriptive Statistics
In addition to the visual description such as stem-and-leaf plot, further description of the distribution by means of a few statistical metrics is useful. Such metrics are called descriptive statistics, which indicate where the most frequently observed values are concentrated (i.e., representative location) and how much the occurring chance distribution is scattered around that concentrated location (i.e., dispersion).
1.3.1 Statistic
A statistic is a function of data, wherein a function usually appears as a mathematical expression that takes the observed data and reduces (processes) to a single

1.3 Descriptive Statistics

9

Fig. 1.10 Presentation of WBC data of 31 subjects by a stem-and-leaf plot

Fig. 1.11 Viewpoint of the stem-and-leaf plot from above
summary metric, e.g., mean = sum over all data divided by the sample size. Note that the word mathematical expression is interchangeable with formula. As the term “formula” is usually referred to in a plug-and-play setting, this book names it a mathematical expression, and the least amount of the expression is introduced only when necessary.

10
Fig. 1.12 Stem-and-leaf versus box-and-whisker plots

1 Description of Data and Essential Probability Models

Fig. 1.13 Histogram versus box-and-whisker plot
1.3.2 Central Tendency Descriptive Statistics for Quantitative Outcomes
In practice, there are two kinds of descriptive statistics used for quantitative outcomes of which one kind is the metric indices for characterizing the central tendency and the second is for dispersion. The mean (i.e., sum of all observations divided by the sample size), the median (i.e., the midpoint value), and the mode (i.e., the most frequent value) are the central tendency descriptive statistics.

1.3 Descriptive Statistics

11

Fig. 1.14 Stem-and-leaf and box-and-whisker plots of a skewed data set
1.3.3 Dispersion Descriptive Statistics for Quantitative Outcomes
The range (i.e., maximum value – minimum value) and interquartile range (i.e., 75th–25th percentile) are very simple to generate by which the dispersion of a data set is gauged. Other commonly used dispersion descriptive statistics are variance, standard deviation, and coefﬁcient of variation, and these statistics describe the dispersion of data (particularly when the data are symmetrically scattered around the mean), and the variance and standard deviation are important statistics that play a pivotal role in the formal statistical inferences, which will be discussed in Chap. 2.
1.3.4 Variance
The variance of a distribution, denoted by σ2, can be conceptualized as an averaged squared deviation (explained in detail below) of the data values from their mean. The more dispersed the data are, the more the variance. Standard textbooks present deﬁnitional and computational mathematical expressions. Until the modern computer was not widely used, ﬁnding a shortcut for manual calculations and choosing the right tool for a quick and easy calculation had been a major issue of statistical education and practice. Today’s data analysis utilizes computer software, and manual calculation skill set is not highly important. Nonetheless, understanding

12

1 Description of Data and Essential Probability Models

Fig. 1.15 Deﬁnitional formula of variance

the genesis of deﬁnitional expression, at least, is important. The following is the demonstration of the deﬁnitional expression of the variance:

n
ðxi - xÞ2 σ2 i=1n - 1

where

x0is,

for

i

=

1,

2,
n

.

. .,

n

(i.e.,

the

sample

size)

are

the

individual

data

values

and

x is their mean. The notation on the numerator is to sum over all individual terms,

i=1
ðxi - xÞ2, for i = 1 to n (e.g., n = 31 for the WBC data). The term ðxi - xÞ2 for i is the squared deviation of an ith individual data value from its mean and is depicted by d2

in the following visual demonstration.

After this summation is carried out, the resulting numerator is divided by the divisor n - 1 (note that the divisor will be 30 for the WBC data example).

1.3 Descriptive Statistics

13

As depicted in Fig. 1.15, deviations (i.e., xi - x) are presented by horizontal solid, for (xi - x) > 0, or dashed, for (xi - x) < 0, line segments. The length of each line segment represents how far each datum is displaced above or below the mean. How do we cumulate and preserve the deviations of the entire group? If straight summation is considered, the positive and negative individual deviations may cancel out each other, and the resulting sum may not retain the information. Thus, straight summation is not a great idea. The individual deviations are squared ﬁrst and then summed up so that the resulting sum can retain the information (i.e., positive and negative deviations) although the retained quantity is not in the original scale. Then, the sum of squared deviations is divided by n - 1. If it had been divided by n, it could have been the average squared deviation. Instead, the used divisor is n - 1. Normally, an average is obtained by dividing the sum of all values by the sample size n. However, when computing the variance using sample data, we divide it by n - 1, not by n. The idea behind is the following. If the numerator (i.e., sum of squared deviations from the mean) is divided by the sample size, n, then such a calculation will slightly downsize the true standard deviation. The reason is that when the deviation of each individual data point from the mean was obtained, the mean is usually not externally given to us but is generated within the given data set (i.e., referencing to an internally obtained mean value), and the observed deviations could become slightly smaller than what it should be. So, to make an appropriate adjustment for the ﬁnal averaging step, we divide it by n - 1. You may be curious why it needs to be 1 less than the sample size, not 2 less, or some other values. Unlike other choices (e.g., n - 2, n - 3, ..., etc.), n - 1 can handle any sample size that is 2 or greater (note that the smallest sample size that will have a variance is 2). There is formal proof that the divisor of n - 1 is the best for any sample size but it is not necessary to cover it in full detail within this introductory level context.
The calculated variance of the WBC data set is [(5200–4900)2 + (3100–4900)2 + . . . + (6500–4900)2]/(31–1) = 4,778,596. Note that variance’s unit is not the same as the raw data unit (because of squaring the summed deviations).

1.3.5 Standard Deviation
The sptandard deviation of a distribution, denoted by σ, is the square root of variance (i.e., variande), and the scale of the standard deviation is the same as that of the raw data. The more the data are dispersed, the greater the standard deviation is. If the dispersed data form a particular shape (e.g., bell curve), then one standard deviation unit symmetrically around the mean (i.e., above and below the mean) will cover about middle two thirds of the data range (see standard normal distribution in Sect. 1.4.3).

14

1 Description of Data and Essential Probability Models

1.3.6 Property of Standard Deviation After Data Transformations

The observed data often require transformations for the purpose of analysis. One example is to shift the whole data set to a new reference point by simply adding a positive constant to or subtracting it from the raw data values. Such a simple transformation does not alter the distances between the individual data values, thus the standard deviation remains unchanged (Fig. 1.16).
Another example is to change the scale of the data with or without changing the reference point. In general, if data vector x = (x1, x2, . . ., xn) whose mean = μ and standard deviation = σx is transformed to y = aÁx + b, where a is the scaling constant and b is the reference point, then the mean of y remains the same as y = aÁ(mean of x) + b = aÁμx + b and the standard deviation y is σy = aÁσx. Note that adding a constant does not alter the original standard deviation, and only the scaling factor does.
The following example (see Fig. 1.17) is to demonstrate how the means and standard deviations are changed after transformation. The ﬁrst column lists a set of body temperature of eight individuals recorded in °C, the second column lists their deviations from the normal body temperature 36.5 °C (i.e., d = C – 36.5), and the third column lists their values in °F (i.e., F = 1.8C + 32). The mean of the deviations from the normal temperature is 0.33 (i.e., 0.33 degrees higher than the normal temperature on average), which can be reproduced by the simple calculation of the difference between the two mean values 36.83 and 36.5 without having to recalculate the transformed individual data. The standard deviation remains the

Fig. 1.16 Shifted data without changing dispersion

1.3 Descriptive Statistics

15

Fig. 1.17 Scale invariant and scale variant transformations
same because this transformation is a just shifting of the distribution to the reference point 32. The mean of the transformed values to °F scale is 98.29, which can be obtained by the simple calculation of 1.8 times the mean of 36.83; then add 32 without having to recalculate using the transformed individual observations. This transformation involves not only the distribution shifting but also the rescaling where the rescaling was to multiply the original observations by 1.8 prior to shifting the entire distribution to the reference point of 32. The standard deviation of the data transformed to °F scale is 1.12, which can be directly obtained by multiplying 1.8 to the standard deviation of the raw data in °C scale, i.e., 1.12 = 0.62 × 1.8.
1.3.7 Other Descriptive Statistics for Dispersion
Figure 1.18 illustrates the asymmetrical distribution of the WBC that was illustrated in Figs. 1.11, 1.12 and 1.13. The mean, median, and mode are not very close to each other. What would be the best description of the dispersion? The calculated standard deviation is 2186, which can be interpreted that a little less than one-thirds of the data

16

1 Description of Data and Essential Probability Models

Fig. 1.18 An asymmetrical distribution depicted by stem-and-leaf plot
are between 2724 and 7096 (i.e., within mean ± 1 standard deviation) if the contour of the distribution had appeared to a bell-like shape. Because the distribution was not symmetrical, the interquartile range may describe the dispersion better than the standard deviation. The 25th and 75th percentiles (i.e., ﬁrst and third quartiles) are 3400 and 6100, respectively, and this tells literally that half of the group is within this range, and the width of the range is 2700 (i.e., interquartile range = 6100–3400 = 2700).
1.3.8 Dispersions Among Multiple Data Sets
Figure 1.19 presents two data sets of the same measurement variable in two separate groups of individuals. The two group-speciﬁc means are the same, but the dispersion of the ﬁrst group is twice as the second group’s. The difference in the dispersions is not only visible but is also observed in the standard deviations of 10 and 5.
The comparison of the dispersions may become less straightforward in certain situations. What if the two distributions are from either the same characteristics (e.g., body temperatures) from two distinct groups or different characteristics measured in the same unit but of the same individuals (e.g., fat mass and lean mass in the body measured in grams, or systolic blood pressure (SBP) and diastolic blood pressure (DBP) measured in mmHg). In Fig. 1.20, can we say the systolic blood pressure (SBP) values are more dispersed than DBP solely by reading the two standard

1.3 Descriptive Statistics

17

Fig. 1.19 Two data sets with unequal dispersions and equal means

Fig. 1.20 Two data sets with unequal dispersions and unequal means

Table 1.2 Application of CV to compare the dispersions of two different characteristics, measured in the same unit, of the same individuals

N

Mean

Standard deviation

CV

Body fat mass (g)

160

19783.28

8095.68

40.9%

Body lean mass (g)

160

57798.63

8163.56

14.1%

deviations? Although the standard deviation of SBP distribution is greater than that of DBP, the mean SBP is also greater, and the interpretation of the standard deviations needs to take into account the magnitudes of the two means. Coefﬁcient of variation (CV) is a descriptive statistic that is applicable for such a circumstance by converting the standard deviation to a universally comparable descriptive statistic.
CV is deﬁned as a standard deviation to mean ratio expressed in percent scale (i.e., CV = 100 × standard deviation/mean). This statistic is useful for comparing the dispersions of two or more distributions of the same variable in two or more different data sets of the means are not identical, or those of two or more different variables measured in the same unit in the same data set. Table 1.2 demonstrates the situation of comparing the dispersions of two different characteristics measured from the same

18

1 Description of Data and Essential Probability Models

Table 1.3 Application of CV to compare the dispersions of the same characteristics, measured in the same unit, of two distinct groups

N

Mean

Standard deviation

CV

Body fat mass (g)

Group 1

80

21118.04

8025.78

38.0%

Group 2

80

18448.53

7993.01

43.3%

individuals in the same unit. The standard deviation of the fat mass in grams is smaller than that of the lean mass in grams of the same 150 individuals, but the CV of the fat mass is greater describing that the fat mass distribution is more dispersed (CV 43.0% compared to 14.4%).
Table 1.3 compares the dispersions of the same characteristic measured from the same individuals. The standard deviations appeared greater within group 1, but the CV was greater within group 2 describing that the dispersion of fat mass was greater within group 2.

1.3.9 Caution to CV Interpretation
CV is a useful descriptive statistic to compare dispersions of two or more data sets when the means are different across the data sets. However, the CV should be applied carefully. When the dispersions of two distributions are compared, we need to ensure that the comparison is appropriate. A comparison of the dispersions of the same or compatible kinds is appropriate (e.g., CVs of body weights obtained from two separate groups or CVs of SBP and DBP obtained from the same group of persons). However, a comparison of two dispersions of which one is a result of a certain transformation of the original data is that not appropriate. For example, in the case of the body temperature example in 1.3.6, the CV of the original °C is 100 × (0.62/36.82) = 1.68% and the CV of the transformed data via °C – 36.5 is 100 × (0.62/0.33) = 187.88%. Did the dispersion increase this large after the whole distribution simple shift? No, the dispersion did not differ, and the standard deviations remained the same. However, the CV of scale data distribution is different from the original °C scale.

1.4 Statistics for Describing Relationships Between Two Outcomes
1.4.1 Linear Correlation Between Two Continuous Outcomes
Previous sections discussed how to summarize the data observed from a single variable (univariate). This section discusses how to describe a relationship between a set of pairs of continuous outcomes (e.g., a collection of heights measured from

1.4 Statistics for Describing Relationships Between Two Outcomes

19

Fig. 1.21 Linear relationships between two continuous outcomes

Fig. 1.22 Geometry of correlation coefﬁcient
biological mother and her daughter pairs). The easiest way to describe such a pattern is to create a scatter plot of the paired data (Fig. 1.21). Correlation coefﬁcient, ρ, is a descriptive statistic that summarizes the direction and strength of a linear association. The correlation coefﬁcient exists between -1 and 1 (geometry of the correlation coefﬁcient is demonstrated by Fig. 1.22). Negative ρ values indicate a reverse linear association between the paired variables, and positive ρ values indicate the same directional linear association. For example, ρ between x and y, ρxy = - 0.9, indicates a strong negative linear association between x and y, and ρxy = 0.2 indicates a weak positive linear association. Note that the correlation coefﬁcient measures only the linear association. Figure 1.23 illustrates a situation that the correlation coefﬁcient is 0, but there is a clear relationship between the paired variables. The deﬁnitional expression is:

20
Fig. 1.23 Nonlinear relationship between two continuous outcomes

1 Description of Data and Essential Probability Models

ρ=

Σðxi

- xÞðyi ðn - 1Þ

-

yÞ

Á

1 sxsy

,

using the sample pairs of (xi, yi), for i = 1, 2, . . ., n. The denominator (n - 1)sxsy can be rewritten as Σðxi - xÞ2Σðyi - yÞ2, and ﬁnally:

Σðxi - xÞðyi ðn - 1Þsxsy

yÞ

=

Σðxi - xÞðyi - yÞ , Σðxi - xÞ2Σðyi - yÞ2

where x and y are the means of x and y, respectively. This calculation can be cumbersome if done manually, but computer software is widely available, and Excel can also be used (see Chap. 7 for details).

1.4.2 Contingency Table to Describe an Association Between Two Categorical Outcomes
Qualitative categorical outcomes cannot be summarized by the mean and standard deviation of the observed categories even if the categories were numerically coded (i.e., mean value of such a codiﬁed data is meaningless). It is also true that an association of a pair of the numerically categorized outcomes cannot be assessed by the correlation coefﬁcient because the calculation of the correlation coefﬁcient involves the mean value and deviations from the means (see Fig. 1.22). A scatter plot is not applicable for a visual description between a pair of categorical outcomes. To describe the pattern of a set of pairs obtained from two categorical outcomes, the contingency table is used (Fig. 1.24, where each cell number is the observed frequency of the study subjects). The number appeared in each cell (i.e., cell frequency) provides you the information about the association between two categorical variables. Figure 1.24 illustrates the perfect association, moderate association, and complete absence of the association between a disease status and a deleterious risk factor. Figure 1.25 illustrates what data pattern is to be recognized for a summary interpretation. There are 20% (i.e., 10 out of 50) of mothers who are ≤20 years old and delivered low-weight babies, whereas only 10% (i.e., 15 out of

1.4 Statistics for Describing Relationships Between Two Outcomes

21

Fig. 1.24 Patterns of association between two binary outcomes
Fig. 1.25 Data summary by a contingency table
150) of the >20 years old mothers did so. It is also noted that the 20% is greater (and the 10% is smaller) than the marginal proportion of the ≤2500 grams (i.e., 12.5%). This observed pattern is interpreted as a twofold difference in proportion of ≥2500 grams between the two mother groups.
1.4.3 Odds Ratio
Odds ratio (OR) is a descriptive statistic that measures the direction and strength of an association between two binary outcomes. It is deﬁned as a ratio of two odds. The odds is the ratio of a probability of observing an event of interest, π, to the probability of not observing that event, 1 - π, i.e., odds = π/(1 - π). In practical applications, the odds can be calculated simply by taking the ratio between the number of events of interest and the number of events not of interest (e.g., number of successes divided by the number of failures). Thus, the odds ratio associated with a presented risk factor versus the absence of the risk factor for the outcome of interest is deﬁned as:

22

1 Description of Data and Essential Probability Models

OR

=

π1=ð1 π2=ð1

-

π1Þ π2Þ

:

The odds ratio ranges from 0 to inﬁnity, and a value 0 < OR < 1 is a protective effect (in a setting of association between a risk factor and a disease outcome) of the factor (i.e., the outcome is less likely to happen within the risk group), OR = 1 being neutral, and OR > 1 is a deleterious effect of the risk factor (i.e., the outcome is more likely to happen within the risk group). According to the deﬁnition, the odds ratio associated with the mother’s age ≤ 20 years versus >20 years for the offspring’s birth weight ≤ 2500 grams is [0.2/(1–0.2)]/[0.1/(1–0.1)] = 2.25. The same result is obtained simply by the cross-product ratio, i.e., [(10/40)]/[(15/135)] = (10 × 135) / (40 × 15) = 2.25. The interpretation of this is that the odds to deliver the offspring with ≤2500 grams of birth weight among the mothers aged ≤20 years is 2.25 times of that of the mothers >20 years. It is a common mistake to make the following erroneous interpretation that the risk of having low-birth weight delivery is 2.25
times greater. Here, the odds ratio was interpreted as if it were a risk ratio. Note that a
risk is the probability of observing an event of interest, whereas an odds is the ratio
between the two probabilities whose numerator is the probability of observing an
event of interest and the denominator is the probability of not observing it.

1.5 Two Essential Probability Distribution
The remaining part of this chapter concentrates on two probability models that are essential and instrumental to the statistical inference, which are the Gaussian model and binomial model. Let us go over some deﬁnitions ﬁrst. A distribution is a complete description of a set of data, which species the domain of data occurrences and the corresponding relative frequency over the domain of occurrence. Note that the object being distributed is the relative frequency. A probability model (e.g., Gaussian model, binomial model, etc.) is the underlying mathematical rule (i.e., mechanism) about event occurrence (i.e., observed data).
The probability model is describable by means of a random variable and its distribution function. A random variable is a variable that maps an event onto its probability of occurrence. Uppercase letters are used to denote random variables. Let us make a brief review about the random variable through a small experiment. The experiment is to toss a biased coin twice in a row, wherein at each toss, the chance for “heads” falling is 0.7. Let X denote the number of heads (note that an uppercase letter is used to denote a random variable). Then X can be 0, 1, or 2. By denoting P(X = x), the probability of observing x heads: P(X = 0) = 0.3 × 0.3 = 0.09 (i.e., TT), P(X = 1) = 0.3 × 0.7 + 0.7 × 0.3 (i.e., HT or TH) = 0.42, and P(X = 2) = 0.7 × 0.7 = 0.49 (i.e., HH). Note that the variable X connects each event (i.e., total number of heads = 0, 1, 2) and its corresponding probability (0.09, 0.42, 0.49). This variable is called a random variable. For a discrete random variable

1.5 Two Essential Probability Distribution

23

Fig. 1.26 Probability mass function of a discrete random variable

Fig. 1.27 Cumulative distribution function of a discrete random variable

x (i.e., the variable whose values can only be integers numbers such as this example)

with its corresponding probability of occurrence, the set of ordered pairs (xi, f(xi)) for all i s is called a probability mass function (Fig. 1.26).

A cumulative distribution function (in short, a distribution function) of a

random variable, denoted by F(x), is a function that gives, for each of a value of

the random variable (lowercase letters are used to denote the speciﬁc values of the

random variable), the probability that a newly observed value of the random variable

falls at or below that value. In the case of the above biased coin ﬂipping example,

F(x ≤ 0) = 0.09, F(x ≤ 1) = 0.09 + 0.42 = 0.51, and F(x ≤ 2) = 0.09 + 0.42 + 0.49 = 1

(Fig. 1.27).For a continuous random variable X, a probability density function of

X (in short, density function), f(x), is a positive function whose total area under the

graph of the function is 1, and the area between two numbers (e.g., a ≤ x < b) is the

probability, denoted by Pr(a ≤ X < b), that the values of the random variable are

observable in this interval. In the case of the above discrete random variable, the

density function will appear as a three-point function (probability mass function,

Fig. 1.26) because there are only three possible values of the random variable, i.e.,

x = 1, 2, or 3.

σ2 X,

The mean

=

1 -1

ð x

-

Y~fxy(x, y),

and variance of a random

μÞ2f ðxÞdx. A covariance

is σxy =

1 -1

ð x

-

μxÞ

y - μy

variable are μ = -11xf ðxÞdx and between two random variables,
f xyðx, yÞdxdy, where fxy(x, y) is the

density function of the joint distribution of X and Y.

24

1 Description of Data and Essential Probability Models

The variance of a sum of two correlated (with its correlation coefﬁcient of ρ)
random variables, X and Y, can be decomposed as variance of X + variance of Y + 2 Á covariance (X, Y ), and its algebraic expression is σ2xþy = σ2x þ σ2yþ 2σxy = σ2x þ σ2y þ 2ρσxσy.

1.5.1 Gaussian Distribution
The Gaussian distribution describes the continuous data generation mechanism, and it has important mathematical properties on which the applications of event probability computations and inference (see Chap. 2) rely. The name Gaussian refers to the mathematician Carl Friedrich Gauss who developed its mathematical model, which is also called normal distribution because the model describes the probability distributions of typical normal behaviors of continuous outcomes (bell curve) well. This distribution has characteristics that the mean, median, and mode are identical, and the data are largely aggregated around the central location and gradually spread symmetrically. A Gaussian distribution is completely characterized by the mean and standard deviation, and its notation is N(μ, σ2) where μ and σ denote the values of mean and standard deviation (σ2 denotes the variance), respectively.

1.5.2 Probability Density Function of Gaussian Distribution

Probability density is the concentrated quantity of a particular value of the possible data range of a continuous outcome, and this quantity is proportional to the probability of occurrence within a neighborhood of that value. The mathematical expression is:

f ðxÞ = p 1

e

-

ðx

- μÞ2 2σ

,

for

-

1

<

x

<

1:

2πσ

The height of the symmetric bell curve is the size of density (not the actual

probability) concentrated over the values of the continuous outcome random vari-

able x. The value of the random variable x where the density peaks (central location)

and the dispersion (i.e., how the loaded density gets spread out) are completely

determined by the mean and standard deviation of the distribution, respectively. The

area under the density curve which is calculated by integrating the curve from one

standard deviation below the mean to one standard deviation above the mean, i.e.,

μþσ μ-σ

p 1 2πσ

e

-

ð x

- μÞ2 2σ

dx,

is

approximately

68.3%,

meaning

that

a

little

bit

over

middle

two thirds of the group is aggregated symmetrically within one standard deviation

around the mean of any Gaussian distribution. The area under the entire density

1.5 Two Essential Probability Distribution

25

Fig. 1.28 Area under a Gaussian density curve within a closed interval
Fig. 1.29 Evaluation of an upper tail probability (a Gaussian density curve example)

curve is 1, i.e.,

1 -1

p 1 2πσ

e

-

ðx - μÞ2 2σ

dx

=

1.

Figure

1.28

describes

the

probability

that

a

Gaussian random variable x~N(μ, σ2) will be observed between its mean and k times

the standard deviation above the mean, i.e., μ ≤ x ≤ μ + kσ, which is evaluated as the

area under the density curve with this interval, i.e.,

μþkσ μ

p1 2πσ

e

-

ðx - μÞ2 2σ

dx.

1.5.3 Application of Gaussian Distribution
The Gaussian (normal) distribution model is a very useful tool to approximately calculate the probability of observing a certain numerical range of events. The example shown in Fig. 1.29 is to ﬁnd out the proportion of a large group of pediatric subjects whose serum cholesterol level is above 250 mg/mL if the group’s cholesterol distribution follows a normal distribution with a mean of 175 and a standard deviation of 30. Because the standard deviation is 30, the value of 250 is 2.5 times the standard deviation above the mean (i.e., 250 = 175 + 2.5 × 30). The area under the curve that covers the cholesterol range > 250 is 0.625%, which indicates the subjects with cholesterol level >250 are within top the 0.625% portion. The calculation requires the integration of the Gaussian density function equation. However, we can obtain the result using Excel or standard probability tables of Gaussian distribution. The next section discusses how to calculate the probability using the tables by transforming any Gaussian distribution to the standard normal distribution.

26

1 Description of Data and Essential Probability Models

1.5.4 Standard Normal Distribution

The standard normal distribution is the Gaussian distribution whose mean is 0 and standard deviation is 1, i.e., N (0, 1). A random variable x following a Gaussian distribution can be standardized by the following transformation. In the following equation, x is the random variable that represents a value of the original Gaussian distribution, its mean μ, and standard deviation σ, and z represents the new random variable resulting from the following transformation:

z=

x-μ σ

:

This transformation shifts the entire data set uniformly by subtracting μ from all

individual values and rescales the already shifted data values by dividing them by the

standard deviation; thus, the transformed data will have mean 0 and standard

deviation 1.

The standard normal distribution has several useful characteristics on which the

statistical inference rely (this will be discussed in Chap. 2). First, as shown above,

the density is symmetrically distributed over the data range resembling a bell-like

shape. Moreover, one standard deviation below and above the mean, i.e., the interval

from -1 to 1 on z, covers approximately 68.3% of the distribution symmetrically.

The interval of z from -2 to 2 (i.e., within two standard deviation units symmetri-

cally around the mean) covers approximately 95.5% of the distribution. The normal

range, -1.96 to 1.96 on z which covers 95% of distribution around the mean, is

frequently sought (Fig. 1.30).

Figures 1.31 and 1.32 illustrate the utilization of Table 10.1 to compute the

probability evaluated within a certain interval without using a computer program.

For example, let’s calculate the probability that an observed value (x) will

be 250 or greater if its occurrence mechanism follows a Gaussian

probability model with a mean (μ) of 175 and a standard deviation (σ) of 30. The

probability is

evaluated by a z

=

x-μ σ

transformation, i.e.,

Prðx ≥ 250Þ = Pr

≥

250 - 175 30

= Prðz ≥ 2:5Þ = 0:0062.

Fig. 1.30 Proportion of standard normal distribution covered by 1 (and 1.96) unit of standard deviation above and below the mean

1.5 Two Essential Probability Distribution

27

Fig. 1.31 List of selected normal random variates and cumulative probabilities up to those values

Fig. 1.32 Standardization of an observed value x = 250 from N (μ = 175, σ2 = 302) to z = 2.5 of standardized normal distribution, i.e., N (0, 1)
1.5.5 Binomial Distribution
The probability values that are distributed to the ﬁnite counts of dichotomous event outcomes (e.g., success or failure) are typically modeled by binomial distribution. For demonstration purpose, let us consider the following situation. Suppose that it is known that a new investigative therapy can reduce the volume of a certain type of tumor signiﬁcantly, and the average success rate is 60%. What will be the probability of observing four or more successful outcomes (i.e., signiﬁcant tumor volume reduction) from a small experiment treating ﬁve animals with such a tumor if the 60% average success rate is true? First, calculate the probabilities of all possible outcomes under this assumption, i.e., no success, 1 success, 2, 3, 4, or all 5 successes given the true average success rate is 60%. Note that a particular subject’s single

28

1 Description of Data and Essential Probability Models

result should not alter the next subject’s result, i.e., the resulting outcomes are

independent on experimental animals. In this circumstance, a probability distributed

to the single dichotomous success event outcome (shrunken tumor as the success or

no response as the failure) of each animal is characterized by Bernoulli distribution

with its parameter π, which is the probability of success in a single animal treatment

(i.e., the two probabilities are π, the success rate, and 1 - π, the failure rate). The

single trial, in this case, each trial is a treatment given to each animal, is called

Bernoulli trial. The resulting probability distribution of the total number of successes

out of those ﬁve independent treatment series (i.e., ﬁve independent Bernoulli trials)

is then described by binomial distribution, which is characterized by two parameters

in which the ﬁrst parameter is the total number of Bernoulli trials, n, and the second

parameter is the Bernoulli distribution’s parameter of the success rate, π. In this

example, the total number of independent trials, n, is 5 and the parameter of the

success rate, π, on each single trial Bernoulli distribution is 0.6. Table 1.4 lists all

possible results and their probabilities (0 = failure with its single occurring chance of

0.4, 1 = success with its single occurring chance of 0.6). As shown in the last column

of the table, these calculated probabilities are 0.0102 for 0 successes (i.e., all failures

and the probability of such an event is 0.4 × 0.4 × 0.4 × 0.4 × 0.4 = 0.0102), 0.0768

for 1 success, 0.2304 for 2 successes, 0.3456 for 3 successes, 0.2592 for 4 successes,

and 0.0778 for all 5 successes. The general notation of a binomial distribution

is X~Bi(n, π). In this example, by letting X denote the number of successes, the

notation is X~ Bi (5, 0.6). Let us also note that the Bernoulli distribution is a special

case of binomial distribution, and its general notation is Bi(1, π). Figure 1.33 displays

Bi (5, 0.6). Thus, the probability of observing four or more successes out of the

treatments given to ﬁve independent animals is 0.2592 + 0.0778 = 0.3370. The

probability value of X = x of Bi(n, π) is probability of (X = x successes out of

n independent Bernoulli trials) = Kπx(1 - π)n - x, where K =

n x

is an integer value

multiplier that is the count of all possible assortments of the number of success

events x (x = 0, 1, . . ., n). Readers who are familiar with combinatorics can easily

ﬁgure out

n x

= n ! /[x ! (n - x)!]. In Table 1.4, (n = 5), K = 1 for x = 0, K = 5 for

x = 1, K = 10 for x = 2, . . ., and K = 1 for x = 5. It is straightforward that

the expression of Bi (1, π) is the probability of (X = x out of 1 Bernoulli trials) = πx(1 - π)x, where x = either 1 (for success) or 0 (failure).

While the binomial distribution ﬁts the probability of success counts arising from

a ﬁxed number of independent trials, when the event of interest is not rare (i.e., π is

not very small) and the size of the trial, n, becomes large, then the probability

calculation for a range of a number of success events can be conveniently approx-

imated by using the Gaussian distribution even if the number of successes is not

continuous. Figure 1.34 demonstrates the rationale for such an application. In

general, for n × π ≥ 5 (i.e., the expected number of successes is at least 5), if

n becomes large for a given a π, or π becomes large for a given n, then the distributed

probability pattern of binomial distributions becomes closer to N (μ = n × π, σ2 = n × π × (1 - π)).

Table 1.4 Binomial distribution of with n = 5 and π = 0.6

Number of successes X = x out of n = 5 independent

trials, K:

n=5
x = 0, 1, 2, 3, 4, 5

5 0

5 1

Number of assortments 1 assortment
5 assortments

5

10 assortments

2

5

10 assortments

3

Events of Bernoulli trials

1st 2nd 3rd 4th 5th 00 0 0 0
10 0 0 0 01 0 0 0 00 1 0 0 00 0 1 0 00 0 0 1
11 0 0 0 10 1 0 0 10 0 1 0 10 0 0 1 01 1 0 0 01 0 1 0 01 0 0 1 00 1 1 0 00 1 0 1 00 0 1 1
00 1 1 1 01 0 1 1 01 1 0 1 01 1 1 0 10 0 1 1

Probability
0.4 × 0.4 × 0.4 × 0.4 × 0.4 = 0.60 × 0.45 (subtotal = 0.0102) 0.6 × 0.4 × 0.4 × 0.4 × 0.4 = 0.6 × 0.44 0.4 × 0.6 × 0.4 × 0.4 × 0.4 = 0.6 × 0.44 0.4 × 0.4 × 0.6 × 0.4 × 0.4 = 0.6 × 0.44 0.4 × 0.4 × 0.4 × 0.6 × 0.4 = 0.6 × 0.44 0.4 × 0.4 × 0.4 × 0.4 × 0.6 = 0.6 × 0.44 (subtotal = 0.0768) 0.6 × 0.6 × 0.4 × 0.4 × 0.4 = 0.62 × 0.43 0.6 × 0.4 × 0.6 × 0.4 × 0.4 = 0.62 × 0.43 0.6 × 0.4 × 0.4 × 0.6 × 0.4 = 0.62 × 0.43 0.6 × 0.4 × 0.4 × 0.4 × 0.6 = 0.62 × 0.43 0.4 × 0.6 × 0.6 × 0.4 × 0.4 = 0.62 × 0.43 0.4 × 0.6 × 0.4 × 0.6 × 0.4 = 0.62 × 0.43 0.4 × 0.6 × 0.4 × 0.4 × 0.6 = 0.62 × 0.43 0.4 × 0.4 × 0.6 × 0.6 × 0.4 = 0.62 × 0.43 0.4 × 0.4 × 0.6 × 0.4 × 0.6 = 0.62 × 0.43 0.4 × 0.4 × 0.4 × 0.6 × 0.6 = 0.62 × 0.43 (subtotal = 0.2304) 0.4 × 0.4 × 0.6 × 0.6 × 0.6 = 0.63 × 0.42 0.4 × 0.6 × 0.4 × 0.6 × 0.6 = 0.63 × 0.42 0.4 × 0.6 × 0.6 × 0.4 × 0.6 = 0.63 × 0.42 0.4 × 0.6 × 0.6 × 0.6 × 0.4 = 0.63 × 0.42 0.6 × 0.4 × 0.4 × 0.6 × 0.6 = 0.63 × 0.42
(continued)

29

1.5 Two Essential Probability Distribution

Table 1.4 (continued)

Number of successes X = x out of n = 5 independent

trials, K:

n=5
x = 0, 1, 2, 3, 4, 5

5 4
5 5

Number of assortments
5 assortments
1 assortment

Events of Bernoulli trials

1st 2nd 3rd 4th 5th 10 1 0 1 10 1 1 0 11 0 0 1 11 0 1 0 11 1 0 0
11 1 1 0 11 1 0 1 11 0 1 1 10 1 1 1 01 1 1 1
11 1 1 1

Probability
0.6 × 0.4 × 0.6 × 0.4 × 0.6 = 0.63 × 0.42 0.6 × 0.4 × 0.6 × 0.6 × 0.4 = 0.63 × 0.42 0.6 × 0.6 × 0.4 × 0.4 × 0.6 = 0.63 × 0.42 0.6 × 0.6 × 0.4 × 0.6 × 0.4 = 0.63 × 0.42 0.6 × 0.6 × 0.6 × 0.4 × 0.4 = 0.63 × 0.42 (subtotal = 0.3456) 0.6 × 0.6 × 0.6 × 0.6 × 0.4 = 0.64 × 0.41 0.6 × 0.6 × 0.6 × 0.4 × 0.6 = 0.64 × 0.41 0.6 × 0.6 × 0.4 × 0.6 × 0.6 = 0.64 × 0.41 0.6 × 0.4 × 0.6 × 0.6 × 0.6 = 0.64 × 0.41 0.4 × 0.6 × 0.6 × 0.6 × 0.6 = 0.64 × 0.41 (subtotal = 0.2592) 0.6 × 0.6 × 0.6 × 0.6 × 0.6 = 0.65 × 0.40 (subtotal = 0.0776)

1 Description of Data and Essential Probability Models

30

1.5 Two Essential Probability Distribution

31

Fig. 1.33 Distribution of Bi (n = 5, π = 0.6)

Fig. 1.34 Large sample behavior of binomial distributions with various trial sizes and success rates

Suppose that we now increased the number of animal experiment to 100, and we

want to compute the probability of observing 50 to 75 successes arising from100

independent trials. Because n × π = 100 × 0.6 = 60, and n × π × (1 - π)

= 100 × 0.6 × 0.4 = 24, this task can be resorted to the normal approximation for

which the used distribution is N (μ = 60, σ2 = 24). Then as depicted in Fig. 1.35, the

ﬁrst step is to transform the interval from 50 to 75 on N (μ = 60, σ2 = 24) to a new

interval

on

N

(0,

1).

Letting

x

denote

the

number

of

successes,

accordping

to

z

=

ðx σ

μÞ

transformation, x = 50 is transformed topz= (50 – μ)/σ = (50–60)/ 24 = - 2.05,

and x = 75 to z= (75 – μ)/σ = (75–60)/ 24 =2.05. So, the probability to observe

between 50 and 75 successes is the area under the density curve of N (0, 1) covering

from -2.05 and 2.05 on z, which is 0.98.

On the other hand, when the event of interest is rare and the size of the trial

becomes very large, then the computation can be approximated by Poisson model in

which the number of trials is no longer an important constant (i.e., parameter) that

32

1 Description of Data and Essential Probability Models

Fig. 1.35 Normal approximation to calculate the probability of observing 50–75 successes arising from 100 independent trials wherein the single Bernoulli trial’s success rate is 0.6

Fig. 1.36 Distribution (probability mass function) of Bi (n = 30, π = 0.01)
characterizes the Poisson distribution. The notation is Pois (λ), where λ denotes the number of average successes of the rare event out of a large number of independent trials. An exemplary outcome that is well characterized by the Poisson model is the number of automobile accidents on a particular day in a large metropolitan city. The rare events can be the ones whose binomial characteristic constants are n × π < 5 (i.e., expected number of successes is less than 5). The next example is a binomial distribution for which the probability calculation can be approximated by a Poisson distribution. Figure 1.36 displays the probabilities of observing 0, 1, 2, . . ., 30 adverse events among 30 independent clinical trials of a new drug if the true adverse event rate = 0.01 (i.e., 1%). The typical pattern of Poisson distribution is that the probability value decreases exponentially after certain number of successes, and as the expected number of successes, n × π, becomes smaller, the value decreases faster. If we let a computer calculate the probability of observing three or more adverse events out of 30 trials, then the result will be 0.0033. If we approximate this distribution to Poisson (λ = 30 × 0.01 = 0.3) and let a computer calculate such an event, the result will be 0.0035, which is not much different from the binomial model–based calculation.

Bibliography

33

Study Questions

1. What are the similarity and dissimilarity between the interval scale and ratio scale?
2. What is the deﬁnition of a distribution? What are being distributed in the distribution?
3. In a box-and-whisker plot, what proportion of the population is contained in the “box” interval? Is such a plot useful to describe a bimodal (i.e., two modes) distribution?
4. Please explain the deﬁnition of standard deviation. 5. What proportion of the data values are within one standard deviation above and
below the mean if the data are normally distributed? 6. Can a correlation coefﬁcient measure the strength of linear and nonlinear
relationship between two continuous observations? 7. What are the deﬁnitions of odds and odds ratio? 8. What are the two parameters that completely characterize a Gaussian
distribution? 9. What are the two parameters that completely characterize a binomial
distribution? 10. Under what condition can a Gaussian model approximate the proportion of a
population that lies within a certain range of number of events describable by a binomial model?

Bibliography
Grimmett, Geoffrey; Stirzaker, David (2001). Probability and Random Processes (3rd Edition). Oxford University Press.
Johnson, Norman L.; Kotz, Samuel; Balakrishnan, N. (1994). Continuous Univariate Distributions, Vol. 1 (2nd Edition). John Wiley, New York.
Ross, Sheldon (2010). A First Course in Probability (8th Edition), Pearson Prentice Hall. Snedecor, George W.; Cochran, William G. (1991). Statistical Methods (8th Edition). Wiley-
Blackwell. Tukey, John W (1977). Exploratory Data Analysis. Addison-Wesley.

Chapter 2
Statistical Inference Concentrating on a Single Mean

Statistical inference is to infer the population characteristics of interest through the observed sample data. If the whole collection of the population data were observed, then there would be no room for uncertainty about the true knowledge of the population, the statistical inference would be unnecessary, and the data analysis would be totally descriptive. For a real-world investigation, a smaller size of the sample data set than that of the whole population is gathered for inference. Since the sample data set does not populate the entire population and is not identical to the population, the inference using the sample data set becomes necessary. This chapter discusses the relationship between the population and sample by addressing (1) the uncertainty and errors in the sample, (2) underpinnings that are necessary for a sound understanding of the applied methods of statistical inference, (3) areas and paradigms of drawing inference, and (4) good study design to minimize/avoid inherent errors in the sample.

2.1 Population and Sample 2.1.1 Sampling and Non-sampling Errors

Let us ﬁrst discuss, before addressing the statistical inference, several important phenomena and statistical concepts arising from using sample data. Suppose that our
objective is to discover the average (i.e., mean) body weight of a large population (N > one million) of men and women. It is impractical to measure every individual in the population. Nonetheless, one can probably investigate by using a sufﬁciently large yet manageable size, n, of well-represented random sample.
Let us assume that a statistician helped you determine the study sample size, n = 1000 individuals, whose sample mean is to serve as a good estimate of the population’s mean body weight. The following are two major sources of uncertainty involved in the sample. The ﬁrst is the sampling error. These 1000 subjects were

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023

35

H. Lee, Foundations of Applied Statistical Methods,

https://doi.org/10.1007/978-3-031-42296-6_2

36

2 Statistical Inference Concentrating on a Single Mean

randomly drawn with an equal chance to be selected. If another separate random sample set of 1000 subjects is taken, then these new 1000 randomly drawn subjects will not be exactly the same individuals as the initial 1000 subjects. The mean values of these two sets will differ from each other, thus both means will differ from the population mean. A discrepancy between the sample means and the true population mean explained solely by this supposition is understood as a sampling error. This sampling error will eventually disappear as the sample size becomes very large. One extreme example is that if the sample consisted of the entire population, then there is no such error.
The next is a non-sampling error. No matter how well the randomly drawn sample subjects represent the whole subjects in the population, it is still possible that the sample mean and true population mean can be discrepant due to other reasons. For instance, if an electronic scale that was used to measure weight systematically underestimated (e.g., always reads 0.1 lb. less than the truth) the true value (by accident) because of a mechanical problem, then the discrepancy between the observed sample mean and the true population mean due to such machine error is understood as the non-sampling error. This kind of error will never disappear even if the sample size becomes very large. While sampling error can be reduced by increasing the sample size, non-sampling error prevention would need a prudent sampling design (Fig. 2.1).

Fig. 2.1 Overview of inference using sample data

2.1 Population and Sample

37

2.1.2 Sample Distribution and Sampling Distribution

The concepts of sample distribution and sampling distribution are the important bases of statistical inference. The following example demonstrates the sample and sampling distributions.
The following example demonstrates the sample- and sampling distributions. We are interested in the average neutrophil counts per 100 white blood cells of healthy adults in a large population. Assume that the healthy adults’ neutrophil counts per 100 white blood cells are normally distributed with a mean of 60 and standard deviation of 5 in this population. Let us also assume that there is no non-sampling error in this sampling.
A random sample of 30 individuals was taken from a large healthy adult population, and the sample mean was calculated. Would this sample mean be exactly 60? Moreover, if 20 research groups, including yours, had taken their random samples with the same sample size (i.e., n = 30) independently from the same population, then those sample means of yours and your colleagues will differ. Figure 2.2 A.1 illustrates how the histograms of these 20 individual sample sets would appear, and Fig. 2.2 A.2 illustrates how the 20 respective sample means would vary. A result from the same kind of experiment except for choosing a larger sample size (n = 300) is also demonstrated (Fig. 2.2 B.1 and B.2).
In Fig. 2.2 A.1, 20 sample distributions are described by histograms demonstrating the observed distribution of each random sample set with a sample size of 30 (so-called sample distribution). Each of the other 20 histograms in Fig. 2.2 B.1 summarizes each random sample’s observed distribution with an increased sample size of 300. Note that the sample distributions with 10-fold larger increased sample size are more reﬂective of the population’s distribution.
Figure 2.2 A.2 depicts the sampling distribution of the 20 sample means in Fig. 2.2 A.1, and Fig. 2.2 B.2 depicts those of 20 sample means in Fig. 2.2 B.1. Note that the observed sampling distribution of the sample means drawn from the 20 repeated samples of size 300 (Fig. 2.2 B.2) is much less dispersed than that from the 20 repeated samples of size 30 (Fig. 2.2 A.2). The standard deviations of these two sampling distributions provide very good sense of how the sample means vary over the repeated random samplings.

2.1.3 Standard Error
The standard deviation of a sampling distribution measures the variability of the sample means that vary over the independently repeated random samplings. This standard deviation is called the standard error of the sample mean, sx. Note that in real-life research, it is unrealistic to draw multiple samples in order to describe the sampling distribution as demonstrated in Section 2.1.2. The above example was to facilitate conceptualization of the sample- and sampling distributions. The

38

2 Statistical Inference Concentrating on a Single Mean

Fig. 2.2 Sample distributions of random sampled data sets from a normal distribution and the sampling distribution of their sample means

2.1 Population and Sample

39

Table 2.1 Sample- and sampling distributions

Distribution of
As the sample size increases the shape of distribution Name (and notation) of the dispersion statistic is Relationship between s and sx

Sample distribution Only one sample data set
Becomes closer to the population distribution Sample standard deviation (s)
p s = nsx

Sampling distribution Sample means being obtained over multiple sets of random samples Becomes symmetrical and narrower
Standard error (SE) of the mean, sx p
sx = s= n

investigators draw only one sample set, and there is uncertainity about how far/close the obtained sample mean is from/to the true population mean (i.e., such a presentation depicted by Fig. 2.2 cannot be made). There is a mathematical equation (but it is not derived in this monograph) that allows the investigators to be able to utilize to estimate the standard error of mean based solely on the observed sample data set, which is to simply divipde the sample standard deviation, s, by the square root of its sample size, n, i.e., s= n. In the above illustratiopn, a standard error of the sample mean can be estimated by directly applying s= 30 to any of the 20 sample distributions with sample size of 30, which would be very close to 0.67 (or 0.3112 for n = 300) with a small sample–to-sample variation. Such an estimated standard error is very close to the standard deviation of the 20 observed sample means from the experiment (such an experiment is called a Monte Carlo simulation). Table 2.1 summarizes the distinction between the sample- and sampling distributions as well as the relationship between the sample standard deviation and standard error.

2.1.4 Sampling Methods and Sampling Variability of the Sample Means
What makes a sampling method unique is the sampling probability, i.e., the probability that an individual sampling unit is assigned to be sampled. For example, the sampling methods can be classiﬁed into the equal probability sampling and the unequal probability sampling. Intuitively, random sampling with replacement (i.e., a sampling unit is sampled with equal probability) is equal probability sampling, and the random sampling without replacement is unequal probability sampling. There can be many forms of unequal probability sampling methods. Each sampling method will offer a unique latitude of accuracy and precision of the estimator. This section discusses some useful relationships between the sampling probability and accuracy and precision of the sample mean estimate, i.e., how the bias and sampling variance of the sample mean can be reduced by unequal probability sampling techniques.
Consider a small experiment in which you place four balls each marked with two-digit numbers 12, 23, 34, and 45 in an urn container. You draw a ball, read the

40

2 Statistical Inference Concentrating on a Single Mean

Table 2.2 Sampling two

(y(1), y(2))

y

balls out of four in a jar with replacement and the calcu-

(12, 12)

12

(y(1), y(2))

y

(34, 12)

23

lated means of the observed (12, 23)

17.5

(34, 23)

28.5

ordered pairs

(12, 34)

23

(34, 34)

34

(12, 45)

28.5

(34, 45)

39.5

(23, 12)

17.5

(45, 12)

28.5

(23, 23)

23

(45, 23)

34

(23, 34)

28.5

(45, 34)

39.5

(23, 45)

34

(45, 45)

45

number, return it to the container (i.e., random sampling with replacement), then draw a ball again to read the drawn number (i.e., sampling sequentially two balls with replacement, i.e., a random sampling with replacement with its sample size of 2), and then ﬁnally calculate the mean of the two observed numbers. Table 2.2 is the complete list of all possible 16 samples and their means of the ordered observed pairs (y(1), y(2)).
The mean of those 16 sample means is (12.00 + 17.00 + . . . + 45.00)/16 = 28.5, and it is the same as the population mean of the four numbers in the container, i.e., (12 + 23 + 34 + 45)/4 = 28.5. It is indeed expected that the mean of all possible 16 sample means observed through the above experiment must be the same as the true population mean. The arithmetic mean of those sample means over the whole sampling distribution is called the expectation (or expected value) of the sample mean. Likewise, the variance of the sample means can be calculated as the arithmetic mean over the squared deviations of every sample mean from the expectation, i.e., [(12–28.5)2 + (17.5–28.5)2 . . . + (45–28.5)2]/25 = 75.625.
Now, let us consider another experiment, but at this time let us increase the chance for 45 being sampled (this will be possible if we duplicate this ball so that 12, 23, 34, 45, and an extra 45 will be placed in the container). Then, let us conduct the experiment of sampling two balls with replacement again. With respect to the unique face value of the drawn ball, this experiment is an unequal chance sampling, in that the chances for the four unique values to be sampled are unequal, i.e., each draw will have a 20% chance to pick 12, 23, or 34, but 40% chance to pick 45. Table 2.3 is the complete list of all possible 25 sample means of the ordered pairs (y(1), y(2)).
This presentation can be modiﬁed as is shown below (see Table 2.4), in that the events are combined commutatively (e.g., 12, 23 represents two events (12, 23) and (23, 12)).
The arithmetic mean of all these 25 sample means is (17.5 × 2 + 23 × 2 + 28.5 × 4 + 28.5 × 2 + 34 × 4 + 39.5 × 4 + 12 × 1 + 23 × 1 + 34 × 1 + 45 × 4)/25 = 31.8. It is obvious that this result differs from the result of the ﬁrst experiment because 45 was drawn twice more frequently than other numbers. The variance (i.e., arithmetic mean of the squared deviations of the sample means from the expected value) of these means is [(17.5–31.8)2 × 2 + (23–31.8)2 × 2 + (28.5–31.8)2 × 4 + (28.5–31.8)2 × 2 + (34–31.8)2 × 4 + (39.5–31.8)2 × 4 + (12–31.8)2

2.1 Population and Sample

Table 2.3 Sampling two balls, with unequal chances, out of four in a jar with replacement and the calculated means of the observed ordered pairs

(y(1), y(2)) (12, 12) (12, 23) (12, 34) (12, 45) (12, 45) (23, 12) (23, 23) (23, 24) (23, 45) (23, 45) (34, 12) (34, 23) (34, 34)

Table 2.4 Commutatively combined ordered pair events listed in Table 2.3

y1, y2 12, 23 12, 34 12, 45 23, 34 23, 45 34, 45 12, 12 23, 23 34, 34 45, 45

41

y

(y(1), y(2))

y

12

(34, 45)

39.5

17.5

(34, 45)

39.5

23

(45, 12)

28.5

28.5

(45, 23)

34

28.5

(45, 34)

39.5

17.5

(45, 45)

45

23

(45, 45)

45

28.5

(45, 12)

28.5

34

(45, 23)

34

34

(45, 34)

39.5

23

(45, 45)

45

28.5

(45, 45)

45

34

y

Frequency

17.5

2

23

2

28.5

4

28.5

2

34

4

39.5

4

12

1

23

1

34

1

45

4

times; 1 + (23–31.8)2 × 1 + (34–31.8)2 × 1 + (45–31.8)2 × 4]/25 = 82.28. Can we invent a method of the sample mean estimation so that it would offer the sample mean as if the chance for 45 to be sampled were the same as that of the other four balls (i.e., 20%)? Because the chance for each unique number to be sampled is known (i.e., 20% for 12, 23, and 34, and 40% for 45), can the sample mean calculation of the two numbers at each draw be modiﬁed by assigning less weight to 45 than the weight assigned to 12, 23, or 34? One option is to assign such a weight to every drawn number that is inversely proportional to the chance for each unit to be sampled (so-called inverse probability weight, IPW). The new mean, yipw is the inverse probability weighted mean of the two numbers while ignoring the drawing order. This new weighted mean is (y1/p1 + y2/p2)/2 instead of (y1 + y2)/2, where p1 and p2 are the sample probability for each draw. This calculation is intuitively straightforward but produces the weighted sum, not the mean. Thus, to obtain the weighted mean, it is necessary for the weighted sum to be multiplied by 0.25, which is the fair chance per draw from the original four balls (i.e., the chance to draw 12, 23, 34, or 45), and yipw = 0.25 × (y1/p1 + y2/p2)/2 (See Table 2.5).

42

2 Statistical Inference Concentrating on a Single Mean

Table 2.5 Application of inverse probability weight to calculate the mean of each observed event

y1, y2 12, 23 12, 34 12, 45 23, 34 23, 45 34, 45 12, 12 23, 23 34, 34 45, 45

yipw 21.875 28.75 21.5625 35.625 28.4375 35.3125 15 28.75 42.5 28.125

Frequency 2 2 4 2 4 4 1 1 1 4

Note that the expected value of the yipw is now 28.5, i.e., yipw = (21.875 × 2 + 28.75 × 2 + . . . + 28.125 × 4)/25, and this is the same as the
expectation of the sample means demonstrated in Table 2.2. What about the variance of the IPW-based 25 sample means? It is [(21.875 – 28.5)2 × 2 + (28.75 – 28.5)2 × 2 + . . . + (28.125 – 28.5)2 × 1]/25 = 37.86. In this example, the IPW estimate of the sample mean was unbiased (i.e., the expected mean was the same as that of the ﬁrst experiment). Moreover, the sampling variance decreased as well. However, it is not always true that IPW can decrease the sampling variance.

2.2 Statistical Inference
In the remaining sections of this chapter, we will discuss necessary data reduction procedures for the inference. We will discuss two pillars of inference of which one is hypothesis testing, and the other is estimation. Then, we will also discuss brieﬂy about two paradigms of statistical inference, frequentist inference and Bayesian inference.

2.2.1 Data Reduction and Related Nomenclatures
The procedure of the statistical inference can be viewed as an itinerant avenue that connects the sample to population. With a given sample data set, for both the hypothesis testing and estimation, the very ﬁrst step to walk through that avenue is to reduce the sample data set into several descriptive summary statistics (i.e., extract the summary statistics out of the data set). Such an intervening step of operation is called data reduction. A descriptive data analysis being applied in the sample data for the purpose of making a statistical inference is a good example of data reduction.
A parameter is a measured characteristic of a population (e.g., mean age, mean blood pressure, proportion of women, etc.). A statistic is a measured characteristic

2.2 Statistical Inference

43

as a function (i.e., processed form) of sample data (e.g., sample mean age, sample mean of blood pressure, sample proportion of women, etc.). Estimation is the procedure to know the value of the population parameter of interest using the sample data. An estimator is a mathematical function of sample data that is used to estimate a parameter (e.g., x = [x1 + x2 + . . . + xn]/n, where x1 + x2 + . . . + xn is the sum of all observed values of variable x and n is the sample size). Note that an estimator is also a statistic. An estimate is a certain observed value of the estimator (e.g., mean year age estimate = 23, etc., i.e., a resulting value from data reduction).

2.2.2 Central Limit Theorem
The central limit theorem (CLT) is one of the important theoretical bases for inference. It describes the typical phenomenon of sample means (probably the most common central tendency statistic) arising from random sampling.
The demonstrated sampling experiments below will help the readers comprehend the meaning and usefulness of the CLT. The next two experiments are slightly different from the ones illustrated in Sect. 2.1.2 in that the population distribution of these sampling experiments is a continuous non-Gaussian distribution. The population distribution from which the samples are drawn is a bimodal (i.e., two modes) distribution, which often characterizes a distribution as a mixture of two subgroup distributions resulting in two subgroups clustered around two different central locations (Fig. 2.3).
Experiment 1: Draw 30 distinct random sample sets from the given population set with sample size = 25 and make 30 separate histograms for these individual sample distributions. Then make a histogram of the 30 sample means that are obtained from individual sample distributions.
Experiment 2: Repeat the same kind of experiment by increasing the sample size to 100 and create the histograms the same way as the previous experiment.
In Fig. 2.4, as expected, each sample distribution appeared similarly to the population distribution, and those from the sample size of 100 resembled the original population distribution more closely. Notably, the sampling distribution of the sample means drawn from the sample size of 100 appeared unimodal (i.e., one mode) and symmetrical (i.e., bell-like) although the population distribution was

Fig. 2.3 Histogram of a bimodal distribution

44

2 Statistical Inference Concentrating on a Single Mean

Fig. 2.4 Sample distributions of randomly sampled data sets from a non-normal distribution and the sampling distributions of their sample means
bimodal. The dispersion of the sampling distribution of n = 100 decreased (standard error decreased from 3.3 to 1.9), which is the same phenomenon that was already observed in the previous experiment in Sect. 2.1.2 (i.e., decreased sampling error for the increased sample size).
Central Limit Theorem (CLT) For a random variable that follows any continuous distribution having its ﬁnite** mean = μ and standard deviation = σ, if random samples are repeatedly taken over many (m) times independently where each sample size nk = n (for k = 1, 2, . . ., m), then the sampling distribution of the m sample means xk (for k = 1, 2, . . ., m) will approach to a normal distribpution with its population mean = μ, population mean, and standard deviation = σ/ n as each sample size n increases inﬁnitely.
** Cauchy distribution does not have its ﬁnite mean and variance. The sampling distribution does not converge to a normal distribution as the sample size becomes large, and it is still Cauchy distribution.

2.2 Statistical Inference

45

2.2.3 The t-Distribution

When the sample size, n, is not large, the sampling distribution of sample means xk arising from the random sampling from a normally distributed population may not be well approximated by the normal distribution (more spread out than the normal distribution, and the CLT may not be fully applicable). Moreover, in many realworld research studies, their population standard deviations can be unknown. When the sample size is not large enough and the population standard deviation is also unknown, the random variable deﬁned below:

p

t = ðsx=-pμnÞ =

nðx s

μÞ

,

where s is the sample standard deviation, which will follow a Gossett’s student t-distribution (simply called t-distribution) with a parameter that is so-called degrees of freedom, df, which is not deﬁned in the Gaussian distribution, wherein df = n - 1. The probability density curves of t-distributions have heavier tails than that of N(0, 1) when df is small (see Fig. 2.5), and those become very close to it as the df becomes large (see Fig. 2.6).

Fig. 2.5 Probability density curve of tdf = 5 and N(0, 1)

Fig. 2.6 t-distributions with small and large dfs versus N(0, 1)

46

2 Statistical Inference Concentrating on a Single Mean

In-depth understanding of the degrees of freedom requires ample knowledge of mathematical statistics and linear algebra, and the following is a simple explanation for the applied users. The degrees of freedom (df), whenever appears in this book, is understood as a parameter that characterizes a unique probability distribution (e.g., t, F, or χ2 will be discussed later). In practice, ﬁnding out df is necessary for inference. The upcoming chapters and sections will concentrate the discussion of the degrees of freedom only on the minimally necessary knowledge about it while leaving out from this book the details of the calculation and letting the statistical analysis software package programs handle it. Besides the degrees of freedom, there is an additional parameter to characterize a t-distribution, but it has not been dealt with yet and will be introduced in Chap. 8 because the involvement of the non-centrality parameter is unnecessary until the power and sample size topic appears in Chap. 8. Until then, all t-distributions being dealt with are assumed to have their non-centrality parameter of 0 (central t-distribution).

2.2.4 Hypothesis Testing
As mentioned earlier in this chapter, the two pillars of statistical inference are hypothesis testing and estimation.
2.2.4.1 Frame of Hypothesis Testing
The main objective of a scientiﬁc investigation is to convince that a new discovery is different (improved) from what has been discovered in the past. Scientiﬁc investigations usually involve formal procedures consisting of articulating a research hypothesis about the anticipated new ﬁnding, designing a study, conducting the study (i.e., experiment or observation) and gathering data, and performing data analysis (i.e., making statistical inference) to reveal the data evidence that is beyond the reasonable doubt. The primary reason for performing data analysis is to conduct statistical inference. We will discuss the hypothesis testing ﬁrst and then discuss the estimation (concentrating especially on the interval estimation).
Hypothesis testing requires stated hypotheses (a statement of the hypothesized facts) and a test rule in that the researcher states mutually contradictory hypotheses pair of which the ﬁrst is to be ruled out if data evidence is not strong and the second is to be favorably admitted if data evidence is strong. For example, “a new intervention A is as efﬁcacious as the currently available intervention B” can be such a form of the ﬁrst hypothesis of the pair and “a new intervention A is more efﬁcacious than the currently available intervention B” can be the second of the pair. Traditionally, the ﬁrst hypothesis is denoted by H0 and the second by H1. This is like a challenger H1 is challenging H0 with the observed data evidence.

2.2 Statistical Inference

47

Having stated H0 and H1, the next step is to test if the data evidence favors H0 or H1 based on a rigorous and objective statistical rule. The test result can either rule out H0 so that the study can ﬁnally pronounce that H1 wins H0, or vice versa. The logical framework of claiming H1 given the data evidence is not to let the data evidence prove the proposition H1 directly, but it is rather to rule out H0 if the observed data showed signiﬁcant counter-evidence against H0. On the other hand, if the counterevidence against H0 was not signiﬁcantly strong, then this logical framework lets the researcher return to H0 (i.e., the previous scientiﬁc ﬁnding remains the same, so go back to the drawing board). Using such a procedure, a researcher can set H1 as a new milestone if the data evidence was signiﬁcant to reject H0 (i.e., advancement over previous ﬁnding). Thus, H0 is called the null hypothesis (i.e., back to null) and H1 is called the alternative hypothesis.
In this procedure, the test rule-based ﬁnal decision is to reject the null hypothesis, H0, or to fail to reject it. Here, “fail to reject” H0 is not synonymous with “accept” H1 in that the observed data cannot be absolutely certain and perfect because the
observed sample data always involve uncertainty (i.e., sampling and non-sampling
errors), and the sample data can never be able to prove either hypothesis. One may
argue why we do not carry out an inference involving only H1 as a single hypothesis and let the study data directly prove it? Justiﬁcations can be made from purely mathematical to very pragmatic manners. One pragmatic justiﬁcation can be that the
H0 versus H1 approach always directs the researchers to the next study plan because the approach offers two options either returning to H0 or proceeding with H1. The analogy is the logical framework of the courtroom trial. A defendant remains innocent (H0) if there are insufﬁcient factual evidence, otherwise, they becomes guilty (H1) if there were sufﬁcient factual evidence.

2.2.4.2 Step-by-Step Overview of Hypothesis Testing
This is a formulae-less overview of the hypothesis testing. For the sake of convenience, the ﬂow can be broken down into ﬁve steps. Note that such a breakdown is quite arbitrary and made for convenience.
Step 1: Stating null (H0) and alternative (H1) hypotheses. This step usually takes place at the beginning of the study (i.e., protocol development stage). The study investigator translates the research hypotheses into the statistical hypotheses and writes them in the statistical analysis plan section of the protocol.
Step 2: Establishing the test rule (decision rule to determine the signiﬁcance of the observed data evidence to reject the null hypothesis). This step also usually takes place at the protocol development stage. The study investigator articulates the decision rule (i.e., method of a test) in the statistical analysis plan section of the protocol.
Step 3: Collecting data (i.e., conducting clinical study according to the written protocol) and data reduction (i.e., performing data analysis to obtain sample statistics).

48

2 Statistical Inference Concentrating on a Single Mean

Step 4: Data analysis (i.e., apply the test rule established in Step 2 to the collected data).
Step 5: Making interpretation and report writing.

2.2.4.3 Stating Null and Alternative Hypotheses
The following four pairs of null and alternative hypotheses are typical setups of the hypotheses for one mean inference (Table 2.6).

2.2.4.4 How to Phrase the Statistical Hypotheses
A hypothesis is a statement about a fact in the population (i.e., not a statement about the sample), and the standard format of a written statistical hypothesis is that its tense is present and the language “statistically signiﬁcant” should not be in the sentence. Table 2.7 exhibits some examples of improper phrases.

2.2.4.5 Signiﬁcance of the Test
Having stated the null and alternative hypotheses, the researcher collects data that would or would not disprove the null hypothesis. While there are many important

Table 2.6 Null and alternative hypotheses

Alternative Null hypothesis hypothesis

Simple or compositea Directionality of composite Null Alternative hypothesis

H0: μ = μ0 Mean is equal to μ0
H0: μ = μ0 Mean is equal to μ0
H0: μ = μ0 Mean is equal to μ0
H0: μ = μ0 Mean is equal to μ0

H1: μ = μ1

Simple Simple

N/A

Mean is equal to μ1

H1: μ ≠ μ0 Mean is not equal μ0
H1: μ > μ0 Mean is greater than μ0
H1: μ < μ0 Mean is smaller than μ0

Simple Composite Simple Composite Simple Composite

Nondirectional (two-sided)
Directional (one-sided)
Directional (one-sided)

Note: In some tests, the null hypothesis can be directional and composite. For example, H0: μ ≤ μ0 vs. H1: μ > μ0 is such a case. For certain tests, H0: μ = μ0 vs. H1: μ > μ0 is used interchangeably without loss of generality. The readers may consult with an intermediate or
advanced theory literature for more discussion aSimple hypothesis involves a single value of parameter; and composite hypothesis involves more
than one value (e.g., an interval) of the parameter

2.2 Statistical Inference

49

Table 2.7 Examples of improper phrases for statistical hypotheses

Improper phrases
H0: The sample mean is not different from 150 H1: The sample mean is different from 150
H0: The mean is not statistically signiﬁcantly different from 150 H1: The mean is statistically signiﬁcantly different from 150
H0: The mean will not be different from 150 H1: The mean will be different from 150

Reason Hypotheses are statements about the population, not about the sample
The sentence must not include the wording “statistically signiﬁcantly”
Although research hypotheses are written in the future tense, the statistical hypotheses should be written in present tense because they are statements about the fact of the population of interest

constituents of the test, introducing all at once may create confusion. This section discusses the testing procedure as untechnically possible.
A level of signiﬁcance (or signiﬁcance level in short), α, is chosen by the researcher before the data analysis (i.e., this is not a result value from the observed data), and it determines the resilience of the test rule to reject the null hypothesis. This level is the maximally allowed error probability size that the test would reject the null hypothesis erroneously even though it should not be rejected. Such a decision error is called a Type 1 error. The level of signiﬁcance, α, is also called test size. A common choice is α = 5%.
A test statistic is a sample statistic (see Sects. 1.3.1 and 2.2.1 for its deﬁnition) to gauge the strength of evidence opposing H0. Its value is calculated from the observed sample data and it varies from sample to sample (i.e., study to study). The phenomenon of this sample-to-sample variation of the test statistic is measured by the standard error of the test statistic (i.e., standard deviation of the sampling distribution of the test statistic). The strength of this evidence opposing H0 is assessed by the relative extremity (i.e., how unlikely that the calculated test statistic value is observed) of the test statistic according to its sampling distribution (see Sect. 2.1.2 for its deﬁnition). The typical (with few exceptions) formulation of the test statistic is constructed by Observed Estimate ~ Null Value ~ Standard Error (SE) trio shown below:

Test

Statistic

=

Observed Estimate - Null Value SEðObserved Estimate - Null ValueÞ

=

Signal Noise

:

The numerator, Observed Estimate – Null Value, is the deviation of the sample estimate from the hypothesized parameter value speciﬁed in the null hypothesis. In this expression, for instance, for a test in which H0: mean μ = μ0 versus H1: mean μ ≠ μ0, the Observed Estimate is the sample mean x and the Null Value is the parameter value μ0 speciﬁed in the null hypothesis. The SE(.) is the standard error of the numerator, which is the sampling error of the observed difference of (Observed

50

2 Statistical Inference Concentrating on a Single Mean

Estimate – Null Value). The test statistic is the ratio of the former to the latter. The denominator of this expression can be further simpliﬁed in that since the Null Value is a hypothesized value (i.e., constant with no variation), the sampling variability of the Observed Estimate – Null Value will be the same as that of only the Observed Estimate. Thus, SE (Observed Estimate - Null Value) = SE (Observed Estimate), and:

Test

Statistic =

Observed Estimate - Null Value SEðObserved EstimateÞ

A test statistic with a value of 0 indicates the observed mean estimate is equal to the null value, 1 indicates the observed mean estimate’s deviation from the null value is as large as its average random sampling error size (such a value does not indicate a signiﬁcant deviation from H0), -1 indicates the same degree of deviation from the null but to the opposite direction of the value 1 (such a value does not indicate a signiﬁcant departure from H0 either), and 3 indicates the deviation is three-fold greater than the average sampling error size (such a large value may indicate a signiﬁcant deviation from H0) (Fig. 2.7).
Note that this formulation uniﬁes most of the common test statistics that are used for many kinds of comparison situations. This formulation can easily be extended to comparing a difference in two means to a speciﬁed null value of 0 (see Sect. 3.1).
Usually, the name of a test comes from the name of the sampling distribution of the test statistic. For example, t-test is the test in which the test statistic follows the tdistribution with a unique degree of freedom (it will be explained later that the

Fig. 2.7 Illustration of what a test statistic measures

2.2 Statistical Inference

51

degrees of freedom of the t-distribution are uniquely determined by the sample size, see Sect. 2.2.4.5).
A p-value is the probability of observing the test statistic values that are as or more extreme than the currently calculated test statistic value if the null hypothesis H0 was true. This probability is calculated by evaluating the tail area under the density curve of the sampling distribution of the test statistic (see Sects. 1.5.2, 1.5.3, and 1.5.4 for area under a density curve). If the p-value is less than the signiﬁcance level, α, then it is interpreted that the observed data evidence is signiﬁcant to suggest that data have not been gathered from the population that is speciﬁed by the null hypothesis H0, and the probability that such a deviation could have been solely due to a chance alone is less than the adopted signiﬁcance level,α.
Technically, the evaluation of the tail part area under the density curve will be a daunting numerical integration without utilizing a computer. However, resorting the signiﬁcance assessment to the idea of critical region can replace such a numerical integration. A critical region of a test statistic is a collection of all possible test statistic values (i.e., an interval or multiple intervals of the test statistic values on the sampling distribution) of which the total probability for encountering all such values is less than the signiﬁcance level when the null hypothesis H0 is true. The critical region of a test is primarily determined by the adopted signiﬁcance level (i.e., the critical region becomes narrower as the adopted signiﬁcance level becomes more stringent). H0 is then rejected at the adopted signiﬁcance level if the observed test statistic value falls into this region. Note that the critical region is also called rejection region. Checking if the test statistic resides inside or outside of the rejection region can be done using a statistical table of the sampling distribution of the test statistic. The statistical table is a collection of the intervals on the possible range of the test statistic and their corresponding probability of occurrence (use of the tables will be introduced in Sect. 2.2.4.6). Note that if the test statistic value is equal to the critical value, then the p-value is equal to the adopted signiﬁcance level; if it fell into (outside) the critical region, then p-value is less (greater) than the signiﬁcance level.

2.2.4.6 One-Sample t-Test
This section introduces one-sample t-test for the inference about a single mean of a population. The following example is used for introducing the procedure, particularly why this test is called t-test, and how the aforementioned constituents covered in Sect. 2.2.4.5 are put into operation in the previously mentioned ﬁve steps.
Example 2.1 This is an example of a laboratory-based animal study to test if a synthetic hormone that is delivered via dermal gel cream applied directly to the thigh muscle of mouse. The investigator hypothesized that the hormone level circulating in mouse blood measured at 1 hour after proper application is about 15% of the total volume contained in the prescribed gel dose. Furthermore, the circulating hormonal volumes are known to follow a normal distribution. The investigator plans that if the current experiment shows the mean circulating volume is at least 15%, then the gel’s

52

2 Statistical Inference Concentrating on a Single Mean

hormone concentration will not be increased, otherwise a new experiment with an increased concentration rate will be conducted.
Step 1 – Stating null (H0) and alternative (H1) hypotheses:
The ﬁrst step for the hypothesis test inference is to state the null and alternative hypotheses. The investigational objective can be translated as below. Null hypothesis: the mean is 15 (i.e., H0: μ = 15); alternative hypothesis: the mean is less than 15 (i.e., H1: μ < 15).
Step 2 – Establishing the test rule (decision rule to determine the signiﬁcance of the observed data evidence to reject the null hypothesis):
A 5% signiﬁcance level, i.e., α = 0.05, is adopted. The laboratory scientists would reject the null hypothesis if p < 0.05, or equivalently if the observed test statistic falls into the critical region of the test static’s sampling distribution (i.e., the test statistic falls outside the interval determined by the 5% alpha level. The test statistic will gauge the ratio of (Observed mean – Null mean)/(Standard error of the numerator).
Step 3 – Collecting data and data reduction:
The investigator randomly selected 10 mice with the same body weight and then applied exactly the same dose of the gel to each experimental animal. The following data are the circulating volumes measured in percentage value of the delivered gel volume:
14.45, 14.40, 14.25, 14.27, 14.57, 14.99, 12.97, 15.29, 15.07, 14.67.
Step 4 – Data analysis (i.e., apply the test rule established in Step 2 to the collected data):
The objective of data analysis of the hypothesis test is to calculate the test statistic (i.e., data reduction) and make a decision to reject or not to reject H0 based on the rule. The rule had been outlined brieﬂy in step 2. With the observed data, the rule can be completed now. The ﬁrst task is to formulate the test statistic and calculate its value. The second task is to evaluate the signiﬁcance of it either by directly calculating the p-value or by determining if it falls into/outside (of) the critical region of the test statistic. The third task is to make a decision. The test statistic is constructed as (the observed mean – null l value)/SE (of the numerator). The observed mean of the 10 observations is 14.493, and its null value is 15. Note that the null value is the parameter value that is expected if the null hypothesis is true. The numerator part, which is to measure the signal of observed data departure from the null value, is 14.493–15 = -0.507 (approximately half minute faster than 15 minutes). The denominator, which is the random sampling noise (i.e., the standard error of the signal), should be sought. Note that the null value is a ﬁxed constant, and it does not add additional sampling variation to the signal Observed Estimate – Null Value. Therefore, the standard error of the whole numerator remains the same as that of only the observed sample mean. The standard error of the observed sample mean can be calculated by dividing the sample standard deviation

2.2 Statistical Inference

53

by the square root of the sample size (see Secpt. 2.1.3). The sample standard deviation is 0.640, thus the standard error is 0.640/ 10 = 0.203. Finally, the test statistic

value is obtained, i.e., -0.507/0.203 = -2.503. Let us take a close look at the test

ðstxa-tisμtiÞc=. ðTsh=epmn)a,thwehmicahticfoallleoxwpsreasst-iodnistorfibthuitsiotnesitnsttraotdisuticcedis

congruent to the quantity in Sect. 2.2.3, where, the

numerator, x - μ, the hypothesized

is to null

gauge value,

hanodwtmheucdhenthoemoinbasteorrv,esd=spamn,pilse

mean is deviated from to gauge the sampling

variability of the numerator. The test statistic will follow a t-distribution with df = n -

1, if the raw data were drawn from a normally distributed population. Since

circulating hormonal volumes are known to follow a normal distribution, this test

statistic calculated from these 10 observations will follow the t-distribution with

df = 10–1 = 9. The naming convention of a test is to give the name the sampling

distribution of the test statistic. Having known the sampling distribution of this test

statistic is t-distribution, we call this a t-test. It is worth mentioning here at least

brieﬂy that test statistics can be derived from many situations that are not exactly the

same as the above inference (i.e., testing if a single mean is equal to a speciﬁc value).

For example, a test can be devised to compare two means. As we will mention such a

situation and other variants in the later chapters (e.g., independent samples t-test,

paired sample t-test, etc.), there are test statistics derived from various situations that

will also follow t-distribution. To uniquely identify the t-test being applied to this

single mean inference illustrative example, we speciﬁcally identify the t-test applied

to a single mean inference as one-sample t-test.

Lastly, we can calculate the p-value determined by the test statistic value t = -

2.503 based on the t-distribution with its df = 9 (see Fig. 2.8). This calculation can be

carried out by using Excel’s TDIST function, which delivers a calculated value of an

area under the t-distribution’s density curve’s upper tail or both tails speciﬁed by the

user. The user speciﬁcation TDIST(|t|, df, 1 or 2) includes the absolute value of the t-

statistic value, df, and the choice of 1- or 2-tail in this order. For this exercise, TDIST

(|-2.503|, 9, 1) is necessary. Note that the ﬁrst input value of | -2.503| should be

either 2.503 (i.e., without notation) or the absolute value of (-2.503) because the

computer program utilizes the symmetrical feature of t-distributions. The actual code

is TDIST(2.503, 9, 1). Figure 2.9 illustrates how the raw data are organized, the

sample mean and its standard error as well as the t-statistic are calculated, and how

Fig. 2.8 Determination of p-value of a test statistic in a one-sample directional t-test

54
Fig. 2.9 Numerical illustration of calculating test statistic in a one-sample t-test

2 Statistical Inference Concentrating on a Single Mean

Fig. 2.10 Determination of the critical region of a one-sample directional t-test
the p-value is obtained. The calculated p-value =0.017 is much less than the signiﬁcance level 0.05. Therefore, H0 is rejected.
We can also take an approach to check if the t = -2.503 falls into or outside of the critical region. Figure 2.10 illustrates how the critical region of this one-sided one-sample t-test is at a 5% signiﬁcance level using a t-distribution table (Table 10.2) and visualizes it on the density curve of the sampling distribution of the test statistic t (df = 9). This critical region is where the test statistic is less than the ﬁfth percentile of the t (df = 9) distribution. The table in Fig. 2.10 shows the 95th percentile rather than the ﬁfth percentile because the density curve of a t-distribution is symmetrical. The 95th percentile is 1.8331, thus the ﬁfth percentile is -1.8331. The critical region

2.2 Statistical Inference

55

{t ≤ - 1.8331} is also graphically depicted in the ﬁgure. The observed t = -2.503 is less than -1.8331 and falls into this critical region. Thus, H0 is rejected.
Step 5 – Making interpretation and report writing:
From the above steps, this test can be summarized as “These data showed that the volume of the circulating hormone is less than 15% of the prescribed total volume at a 5% signiﬁcance level.”

2.2.4.7 Comments on Statistically Signiﬁcant Test Results
In a clinical study, statistically signiﬁcant evidence with a tiny size of the signal may not necessarily comprise a clinical signiﬁcance. Such a result can be observed in a study with a very large sample size (i.e., unnecessarily larger than the adequate sample size). On the other hand, clinically signiﬁcant evidence can be statistically insigniﬁcant due to an inadequately small sample size or very large data variability. Such a result could have been signiﬁcant if a larger sample size or better error controls (e.g., better study design, etc.) had been considered. See Chap. 6 for more discussions.
Reporting format in the results section of clinical research journals is also important. Table 2.8 shows a few examples of recommended and not recommended formats.

2.2.4.8 Types of Errors in Hypothesis Tests
Hypothesis tests cannot be completely free of decision errors. The ﬁrst type of such error is an error that H0 was rejected although it is true, and the second type is the one that H0 was not rejected although it was not true. The ﬁrst kind is called Type 1 error, and the second kind is called Type 2 error.
The adopted signiﬁcance level of a test, α, predetermined by the investigator is the maximum allowed probability size of Type 1 error. The maximum allowed probability of Type 2 error size is called β and 1 - β is called the power of the test. Figure 2.11 illustrates the probabilities of Type 1 and Type 2 errors of a test in which the null and alternative hypotheses are both simple hypotheses (i.e., a singlevalued hypothesis and the test is directional). As depicted, Type 1 error is the area under the density curve of the sampling distribution of the test statistic within the rejection region under the assumption that H0 is true. Type 2 error is the area under the density curve of the sampling distribution of the test statistic within the non-rejection region under the assumption that H0 is not true (i.e., H1 is true). Chapter 6 will address the relationship between the sizes of Type 1 and Type 2 errors and the sample size, and will discuss how to determine adequate study sample size that keeps Type 1 and Type 2 error sizes small (Table 2.9).
Note that the following ﬁre alarm system metaphor helps the understanding of Type 1 and Type 2 errors, level of signiﬁcance, and power (Table 2.10).

56

2 Statistical Inference Concentrating on a Single Mean

Table 2.8 Examples of recommended and not recommended formats of summary sentences to appear in the results sections of clinical research journal articles

Recommended summary sentences “These data showed that the mean is signiﬁcantly different from 15 min ( p = 0.031)” “These data showed that the mean is signiﬁcantly different from 15 min (p < 0.05)” “These data showed that the mean is not signiﬁcantly different from 15 min at a 5% significance level” “These data showed that the mean is not signiﬁcantly different from 15 minutes (NS at 5% signiﬁcance level)” Not recommended summary sentences “The null hypothesis is rejected because p < 0.05. These data showed that the mean is signiﬁcantly different from 15 minutes ( p = 0.031)” “These data showed that the mean is not signiﬁcantly different from 15 min (p > 0.05)”
“These data showed that the mean is not signiﬁcantly different from 15 min ( p = 0.278)”
“The null hypothesis is rejected because p < 0.05. These data showed that the mean is signiﬁcantly different from 15 minutes ( p = 0.031)”

Show the actual p-value
Simply report that the p-value was smaller than the signiﬁcance level A nonsigniﬁcant test result at the level of signiﬁcance = 0.05
A nonsigniﬁcant test result at the level of signiﬁcance = 0.05
Do not say in the report “. . . the null hypothesis was rejected because p < 0.05 . . .” storytelling of technical details of the procedure is unnecessary Do not report the meaningless p-values in the concluding sentences when the p-value that is greater than your signiﬁcance level. However, in tables summarizing multiple results together, such format is allowed Do not report the meaningless p-values in the concluding sentences when the p-value that is greater than your signiﬁcance level. However, in tables summarizing multiple results together, such format is allowed Do not say in the report “. . . the null hypothesis was rejected because p < 0.05 . . .” storytelling of technical details of the procedure is unnecessary

2.2.5 Accuracy and Precision
The concepts of accuracy and precision are illustrated in Fig. 2.12. Archery Player 1 performed more accurately and less precisely than Player 2. Player 3 performed more accurately and precisely than the other two players. This illustration can be considered as three researchers (Researchers 1, 2, and 3 correspond, to Players 1, 2, and 3 respectively) having repeated 10 random samplings with a choice of ﬁxed sample size (i.e., n1 by Researcher 1, n2 by Researcher 2, and n3 by Researcher 3) to obtain each point estimate (represented by each “x” mark) using each sample data set and the same computational method. Intuitively, Researcher 1 might have chosen a smaller sample size than that of Researcher 2 but might have chosen a better sampling method that prevented the bias that was presented in the estimates of Researcher 2. How Researchers 1 and 2 could improve their results to expect a result like that of Researcher 3? Researcher 1 would increase the sample size without

2.2 Statistical Inference

57

Fig. 2.11 Determination of sizes of Type 1 and Type 2 errors in a one-sample t-test

Table 2.9 Errors in decisions and their probabilities in a hypothesis test

Accept H0
Reject H0

H0 is true
Correct decision Probability of this correct decision is called operating characteristic
Type 1 error Maximum allowed probability to commit Type 1 error in a test = α level (e.g., 5%)

H1 is true
Type 2 error Probability to commit Type 2 error = β
Correct decision Probability of this = 1 – β = Power

Table 2.10 Alarm system metaphor of testing hypotheses

Hypothesis test
Type 1 error Type 2 error Level of signiﬁcance, α Power, 1 - β

Alarm system metaphor Alarm turns on even if there is no ﬁre breakout Alarm does not turn on even if there is a ﬁre breakout The level of false alarm system sensitivity.
Performance level of the alarm system (i.e., turns on whenever it should be) after the sensitivity level is set

having to reconsider the sampling technique, and Researcher 2 would examine possible sources of systematic non-sampling error and eliminate such causes in the future sampling procedure without having to increase the sample size. Connection of study design and sample size to the accuracy and precision is discussed in Sect. 2.2.8.

58

2 Statistical Inference Concentrating on a Single Mean

Fig. 2.12 Illustration of accuracy and precision
2.2.6 Interval Estimation and Conﬁdence Interval
2.2.6.1 Overview
In Sect. 2.2.4, hypothesis testing was applied to a single mean inference. Having rejected H0 and pronounced that the population mean is signiﬁcantly different from the null value at the adopted signiﬁcance level (α), it may be of further interest to ﬁnd out a set of possible range of the point estimates that would not exclude the unknown population mean so that the interval can be considered as a collection of the possible mean values that are different from the unknown population mean with a certain level of conﬁdence. On the other hand, if a test had not been able to reject H0, could a range be found to include the unknown parameter value? The interval estimation is an avenue to make such an inference, which is linked to the precision.
A popular approach is to make an interval estimation by constructing a conﬁdence interval. This approach relies on the theory of sampling distribution of the estimator (e.g., sample mean). For instance, the sample mean is a point estimate of the unknown population mean of interest (see Sect. 2.2.1 for the deﬁnition of an estimate), and the sample standard error measures the sampling variability of the estimated sample mean. With these, an interval around the point estimate (i.e., the sample mean in the case of mean inference) can be constructed based on the sampling distribution of the sample mean. Section 2.2.6.2 demonstrates a rendering

2.2 Statistical Inference

59

Fig. 2.13 Illustration of the concept of normal distribution–based 95% conﬁdence interval of a mean
idea of the 95% conﬁdence interval of the mean, while the derivation is illustrated when the sampling distribution is Gaussian with known–unknown standard deviation of the distribution of the population characteristic of interest. The derived lower and upper limits of the interval are indeed the 2.5th and 97.5th percentiles, respectively, of the sampling distribution of the sample mean, and the standard deviation of this sampling distribution is the standard error of the sample mean. Such a derived interval is one case of many intervals obtainable from many possible repeated random sampling experiments with the same sample size (see Sect. 2.1.2). Of those large number of experiments, about 95% of the time, the individual intervals may contain the unknown population mean (see Fig. 2.13). Such an interval is called a 95% conﬁdence interval of the mean.
2.2.6.2 Conﬁdence Interval for a Mean
If a random sample of size n, (x1, x2, . . ., xn), is taken from a population in which the probability distribution of the population characteristic of interest is a Gaussian distribution with its mean μ and standard deviation σ, then the sampling distribution sdoaefmvthipaeltiinosagnmdðpiσsl=terpimbnue)taiwonnhxoenfwðinxll-isfoμllalÞro=gwðeσ(a=ip.neo.n,)rcmweanilltlrdfaoilsltllriomiwbiuttthitohenesotwarenimtdha)r.mdOenraonermqμuaialvndadilsetsnrtitablnyud,tiatohrdne (i.e., the Gaussian distribution with mean 0 and standard deviation 1).
One can ﬁnd out an interval that covers the middle 95% of the observable sample means based on the standard normal distributipon:
Probabilipty (2.5th percentile ≤ ðx - μÞ=ðσ= n) ≤ 97.5th percentilep) = {-1.96 ≤ ðx - μÞ=ðσ= n) ≤ 1.96} = 0.95. Then solving -1.96 ≤ ðx - μÞ=ðσ= n) ≤ 1.96 for

60

2 Statistical Inference Concentrating on a Single Mean

μ will offer the expression pto obtain the 95% copnﬁdence interval of the population mean: x - 1.96× σ= n ≤ μ ≤ x + 1.96× σ= n. If σ is unknown (it is usually unknown in clipnical studies), then utilipze the sample standard deviation, σ = s, i.e., x - 1.96× s= n ≤ μ ≤ x + 1.96× s= n.
Example 2.2 A small pilot study of healthy women’s systolic blood pressure was conducted. The sample mean and standard deviation estimated from a random sample of 10 women are 115 and 10.31, respectively. What is the 95% conﬁdence interval of the mean systolic blood pressure of this healthy women population? pstanxda=rd1d1e5v,iaatniodntwheillpboepuuslaetdiofnorsctaanlcdualradtindgevthiaetisotanndisarudnekrnroowr, ni.;e.t,hsu=sptnhe=s1a0m.3p1le/
10 = 3.26.
95%CI = ð115–1:96 × 3:26,115 þ 1:96 × 3:26Þ = ð115–6:39,115 þ 6:39Þ = ð108:61,121:39Þ

Result summary: These data showed that the estimated mean systolic blood pressure was 115 mmHg (95% conﬁdence interval–CI: 108.61–121.39). Note that the format of this sentence is mostly recommended in applied research articles.
With a sample size of 10, it is skeptical to assume the sampling distribution of the sample mean with unknown population standard deviation follows a Gaussian distribution with mean of 115 mmHg and its standard error of 3.26 mmHg. Alternatively, a t-distribution can be applied, in that the 2.5th and 95th percentiles of the t-distribution with df = n – 1 = 10–1 = 9 are -2.685 and 2.685, respectively (see Sect. 2.2.3 for df). In this way, the resulting 95% conﬁdence interval using the t-distribution with df = 9 will become slightly wider (because |2.865| > |1.96|).
Let us discuss how to report the result of interval estimation and why a certain phrasing is recommended. Once a conﬁdence interval (CI) has been constructed, the true parameter (e.g., the mean) is either within or outside this interval (i.e., the parameter is not a moving target but a ﬁxed constant, and the constructed interval is a varying interval depending on the sampled data). The probabilistic argument makes sense only prior to the construction of the CI. If many equal-sized random samplings are conducted independently from a normally distributed population, then the sample means of these random sample sets will form a normal distribution (i.e., the sampling distribution). If we obtain, from each sample data set, an interval whose lower limit is sample mean – 1.96 × SE and the upper limit is sample mean + 1.96 × SE (i.e., a 95% CI based on the normal approximation), then each interval can either include or exclude the true population mean. What is being probabilistic is that about 95% of these anticipated intervals will include the true mean. And this argument will not make sense once a researcher’s sample data already produced a conﬁdence interval. What has already happened is either this interval did or did not include the unknown population value. So, it is not stated that “We are 95% conﬁdent ...” but

2.2 Statistical Inference

61

stated to simply focus on the point estimate accompanied with the numerical interval: “These data showed that the estimated mean SBP was 115 (95% CI: 108.61–121.39)”.

2.2.6.3 Conﬁdence Interval for a Proportion
It is often of interest to make inferences about a single population proportion. If we are interested in knowing the proportion of the men and women in a large metropolitan population who were vaccinated with the inﬂuenza vaccine in the past 6 months, we may possibly take a certain size of random sample of men and women and obtain the information about the vaccination by a survey. The question can be translated into “What is the probability that a man or woman in this population received an inﬂuenza vaccination in the past 6 months?” As explored earlier (see Sect. 1.5.5), this question can be well articulated by the Bernoulli distribution in that a single person’s survey answer (i.e., coded with 1 if “Yes,” or 0 if “No”) being the event outcome and p being the population vaccination rate. We can assume that the answers of the sample of the men and women are independent. The actual inference happens as below. With the sample of n men and women, we ﬁrst count the number of “Yes” answers, which is simply the sum of all 1 s and 0 s (i.e., 1 + 0 + 1 + 1 + 0 + 1 + 0 + 0 + . . .), then ﬁgure out what the sampling distribution of that sum value (let x denote this value). It is obvious that this sampling distribution is Bi(n, π), and the inference is all about π given n and x. The inference about π is to ﬁnd its point estimate, perform a one-sample test if π is equal to a certain proportion (e.g., H0: π = 0.3 versus H1: π ≠ 0.3), and/or ﬁnd its 95% conﬁdence interval. Straightforwardly, the point estimate of π can be obtained by x/n. The one-sample test and the interval estimation can be resorted to the normal approximation (see Sect. 1.2.2) if n is large, or directly to the sampling distribution, i.e., binomial distribution. For example, if the survey sample size was 1000 and there were 270 persons who answered “Yes,” then π = 270/1000 = 0.27; the test statistic for a one-sample normal approximation test (i.e., z-test) with H0: π = 0.3 versus H1: π < 0.3 is (0.27 – 0.3)/ 0:27 × ð1 - 0:27Þ=1000 = -0.03/0.014 = -2.137 whose two-sided test’s p-value is 0.0163 (NORMSDIST(-2.137) by Excel), and the lower and upper 95% conﬁdence limits are 0.27–1.96 × 0.014 = 0.242 and 0.275 + 1.96 × 0.014 = 0.298, respectively (i.e., 95% CI of the vaccination rate: 24.2% - 29.8%).
What if the sample size is small and a normal approximation cannot be resorted to the inference? If n1 events were observed out of n Bernoulli trials, i.e., π = n1=n, then Clopper–Pearson exact 100 × (1- α)% lower and upper conﬁdence limits, πL and πU, respectively, can be found by equating the probability that there would have been n1 or more successes with the lower bound of π to α/2, and also equating the probability that there would have been n1 or fewer successes with the upper bound of π to α/2 (i.e., let the tail probabilities of both sides be α):

62

2 Statistical Inference Concentrating on a Single Mean

n x = n1

n x

πxLð1 - πLÞn - x = α=2,

and

n1 x=0

n x

πxUð1 - πUÞn - x = α=2:

Take an example with three observed successes out of n = 10 (i.e., π = 0.3). The

exact 95% conﬁdence limits for a proportion π are found by solving

10 x=3

10 x

πxLð1 - πLÞn - x = 0:025

and

3 x=0

n x

πxUð1 - πUÞn - x = 0:025:

The ﬁrst

estimating equation is to ﬁnd the lower limit by which the tail probability becomes

0.025 and the second equation is to ﬁnd the upper limit by which the tail probability

becomes 0.025 as well. The solved limits (calculated by a computer) πL and πU are 0.0667 and 0.6525, respectively. Note that this exact conﬁdence interval (0.0667

≤π≤ 0.6525) is not symmetrical with respect to π= 0.3, whereas the normal

approximation-based interval, (0.0160 ≤π≤ 0.5840), is symmetrical.

2.2.7 Bayesian Inference
The aforementioned inference in this chapter is called frequentist inference. Its paradigm is to let the parameter be a ﬁxed constant and perform either the hypothesis testing (i.e., reject or do not reject the ﬁxed null hypothesis by the chosen rule) or perform the point and interval estimations without the prior probabilistic description about the parameter. Bayesian inference is a method of statistical inference in which the paradigm is different from the frequentist in that this inference lets the parameter of interest be a random variable (i.e., moving target) and lets the observed data determine the probability that a hypothesis is true. The hypothesis testing in this setting is informal, and the typical format of Bayesian inference is estimation. The term “Bayesian” comes from Bayes, the statistician who popularized the rule (i.e., Bayes’ rule) of conditional probability as described below.
As illustrated in Fig. 2.14, event A can also be conceived as the collection of an overlapping B and a nonoverlapping B. The following intuitive algebra is useful for calculating the conditional probability of occurrence event B conditional on A (i.e., event A is already occurred):

Fig. 2.14 Two overlapping event sets A and B

2.2 Statistical Inference

63

PðBjAÞ = PðA and BÞ=PðAÞ–f1g
PðAjBÞ = PðA and BÞ=PðBÞ–f2g
From ð2Þ, PðAjBÞ × PðBÞ = PðA and BÞ–f3g
Plug ð3Þ into ð1Þ, then
PðBjAÞ = ½PðBÞ × PðAjBÞ]=PðAÞ–f4g
Furthermore, if event B is partitioned into k disjoint sub-events, then P(A) can be written as P(B1) × P(A| B1) + P(B2) × P(A| B2) + . . . + P(Bk) × P(A| Bk). So, {4} can be written as P(Bi| A) = [P(Bi) × P(A| Bi)] / [P(B1) × P(A| B1)] + [P(B2) × P(A| B2) + . . . + P(Bk) × P(A| Bk)], for i (i = 1, 2, . . ., k). Typical application to the Bayesian inference is to consider B1, B2, . . ., Bk as the hypothesized parameter values and event A as the observed data. With such an application, we can calculate P(Bi| A), i.e., the posterior probability distribution of Bi given observed data, with the prior distribution (i.e., P(B1), P(B2), . . ., P(Bk)) and the conditional probabilities to observe data set A given B1, B2, . . ., Bk (i.e., P(A| B1), P(A| B2), . . ., P(A| Bk)).
Technically, the Bayesian inference picks a prior probability distribution (before observing the data) over hypothesized parameter values that vary with chance (i.e., the parameter is viewed as a random variable) depending on the observed data. We then determine the so-called likelihoods of those hypothesized parameter values using the information contained in the observed data. Finally, we determine the likelihoods that are expected over the prior distribution, which is called posterior probabilities of the hypotheses. The ultimate decision is to pick the hypothesized parameter value that gained the greatest posterior probability and identify the narrowest interval that covers 95% of the posterior distribution (i.e., Bayesian 95% conﬁdence interval).
The following example demonstrates how the Bayesian estimation is different from the frequentist estimation.
Example 2.3 Let’s consider a cross-sectional study that estimates the prevalence rate of osteoporosis in an elderly (age 65+ years) women population. The sample size was 500 and the observed number of subjects with osteoporosis diagnosis was 159.
The frequentist inference approach to this problem is to ﬁnd out the point estimate and its 95% conﬁdence interval based on the normal approximation of the sampling distribution of the estimated proportion. The point estimate is 0.32 (i.e., 159/500). The upper and lower limits of its 95% CI are (1596/ 5000) ± 1.96 × ½ð159=500Þð1 - 159=500Þ=500], thus the 95% CI: 0.28–0.36.
The Bayesian approach to construct a 95% conﬁdence interval is different as described in a few steps below. Step 1. The investigator views the population prevalence rate is a random variable (i.e., the prevalence rate can vary with chance) that varies within the [0.01, 1.00) interval. This is articulated as the “prior” (i.e., prior to data collection) distribution of the population prevalence rate is uniform

64

2 Statistical Inference Concentrating on a Single Mean

Table 2.11 Distribution of posterior probabilities

distribution (i.e., the prior probability for each prevalence rate is equally likely within this interval). Step 2. Invoking Bayes’ rule, i.e., the [posterior probability of population prevalence = π] is calculated by [prior probability of population prevalence = π] × [probability of observing 159 cased out of 500 random sampled subjects if π is the true value of population prevalence]. According to the theory, the posterior probability is proportional to (not always equal to because the actual likelihood evaluation may take place only within a selective subset of all possible parameter values) the product of these two quantities. Note that there are inﬁnitely many values within [0.01, 1.00), and the researcher may get to evaluate only 99 equally spaced discrete values, e.g., 0.01, 0.02, . . ., 0.99, and such a calculated probability at each discrete value π is called likelihood. Step 3. The ultimate goal is to ﬁnd the prevalence rate that gives the greatest posterior probability and to construct the narrowest interval that covers 95% of the posterior distribution.
The estimated π of 0.32 is the Bayesian estimate that had the greatest posterior probability. To construct a Bayesian conﬁdence interval, let us note that the sum of column 5 in Table 2.11 is not 1 but 0.002 because the likelihood value was obtained only within 99 values of all possible values, and that according to the theory, the posterior distribution is proportional to the calculated value of the last column. The posterior distribution can be standardized to the last column so that the sum of those 99 posterior probability values can be a probability distribution (i.e., to make the sum = 1). Because the sum of column 5 is 0.002, we can divide every posterior probability value by 0.002.
The interval that is close to 95% around the Bayesian point estimate 0.32 (i.e., ﬁnd where the 95% of the posterior probabilities are distributed around the point estimate) can now be found out. The values of [individual posterior probability]/

2.2 Statistical Inference

65

Table 2.12 Rescaled posterior probabilities of the prevalence near the 95% Bayesian conﬁdence interval

[sum of the all individual posterior probabilities] = [individual posterior probability]/ 0.002 for hypothesized prevalence value for the following values cover about 95% of the posterior distribution are listed in Table 2.12.
The meaning of the frequentist conﬁdence interval might not have been very clear (see Sect. 2.2.6.2), and appreciation of the Bayesian conﬁdence interval can be helpful. The Bayesian 95% CI is the interval where 95% of the actual posterior probability is distributed, thus this CI is interpreted as “based on the observed data we are 95% conﬁdent that the unknown true and varying population parameter exists in this interval,” whereas the requentist conﬁdence interval is not such an interval as we discussed in the earlier part of this chapter.
2.2.8 Study Design and Its Impact to Accuracy and Precision
Good study designs will minimize sampling errors (i.e., increase precision) and non-sampling errors (i.e., increase accuracy and decrease bias) (Fig. 2.15).
2.2.8.1 Sampling and Non-sampling Error Control
Sampling error is the random error involved in the sample statistics (i.e., estimates of the population parameters of interest) arising from sampling (i.e., sample-to-sample random ﬂuctuation), and it becomes very small when the sample size becomes very large. An extreme example is that if the observed sample is the whole population,

66

2 Statistical Inference Concentrating on a Single Mean

Fig. 2.15 Population, sample, sampling, and non-sampling errors
then there is no sampling error (see Sect. 2.1.1). How can a study design control (i.e., reduce) the sampling error? An effort to increase the sample size, as much as it can be, will reduce the sampling error. Determination of the sample size is a very important part of the study design, and its ultimate goal is to allow the minimally tolerable sampling error size of the sample estimate to the extent that the investigator can ﬁnd signiﬁcant data evidence to answer the study question. However, it is unrealistic to make the sample size very close to that of the population size. Chapter 8 will discuss the statistical sample size determination based on the statistical power of the hypothesis test.
Non-sampling error is the systematic error that causes the bias involved in the sample statistics due to a poor study design. Unlike the sampling error, it is not sample size dependent. How can the non-sampling error be controlled? A proper study design is necessary to identify the source of the bias causing non-sampling errors and prevent it before data collection. The next sections will brieﬂy introduce the types of studies and popular approaches to study design.
2.2.8.2 Study Types and Related Study Designs
In clinical research settings, studies can be categorized into either observational or experimental study. In the observational study, the outcome-causing factors are not controlled by the study investigator. For example, in a study of gender difference in health-seeking behavior, the researchers cannot assign sex to the study subject because sex is not a condition that can be created by the researcher. On the other hand, in the experimental studies, the study investigator controls the outcome-

2.2 Statistical Inference

67

causing factor. For example, the dose levels of a dose response study of a certain medication are determined by the researcher.

2.2.8.3 Observational Study Designs
Case series design is applied in a small-scale medical investigation to describe patient information seen over a relatively short period of time. This is a purely descriptive study design, and it does not involve between-group comparisons and there are no control subjects. Because of such a descriptive nature, this design does not construct hypotheses. The result of a study by this design cannot be generalized.
Cross-sectional design is used to examine a selected sample of subjects from a deﬁned population at one point in time (Fig. 2.16). The examples are diseasediagnostic test performance study, opinion survey in an election year, etc. Such a design cannot serve as a good design to investigate a causal determinant of the outcome because it takes time for a resulting outcome is manifested by its causal factor.
Cohort design is to monitor the cohorts of exposed (to a causal factor) and unexposed subjects prospectively over time and to examine the effect of the exposure on the long-term outcome by comparing between the exposed and unexposed sub-cohorts (Fig. 2.17). This design is useful for studies of relatively common outcomes. However, it requires extremely large cohort sizes to study rare outcomes and generally requires a long study period. The potential bias can be prevented relatively easily by this design because the inclusion/exclusion criteria for the cohorts are deﬁnable at the outset.
Case–control design is to overcome the limitation of the cohort design, particularly the long study period and large cohort size requirement for rare outcomes. Unlike the cohort design, the case–control design looks back-sampled cases and controls retrospectively to verify the risk factor exposure status in the past by which sufﬁcient size of rare events can be ascertained through large registry, etc. For example, if a cohort design is considered to study whether men with BRCA 1/2 mutation have an increased lifetime risk of breast cancer would require large cohorts

Fig. 2.16 Cross-sectional study design

68

2 Statistical Inference Concentrating on a Single Mean

Fig. 2.17 Cohort study design
Fig. 2.18 Case–control study design
of cancer-free men with and without the mutation and must wait until the study observes enough number of cancer incidence cases in both groups for reliable comparison (Fig. 2.18). However, the case–control design can reduce the study burden by collecting enough number of male breast cancer patients from an already accessible large cancer registry that has been established over many decades and the same number of healthy control men then collect their DNA and perform the comparative analysis. Although the case–control design is less burdensome than the cohort design, the chance of bringing in non-sampling errors is much greater than the cohort design. For example, if the exposure is not measurable by an objective biological material such as DNA mutation, then the collection of the past exposure history may induce recall bias (see Table 2.13). Long-term radiation exposure, smoking history, etc., can be examples of such exposures.
2.2.8.4 Experimental Study Designs Experimental designs can rule out potential sources of non-sampling errors. In a clinical research setting, the randomized study design is such a kind. The concurrent

2.2 Statistical Inference

69

Table 2.13 Comparison of advantages (+) and disadvantages (-) between cohort and case–control designs

Cohort design

Case–control design

- Long study period

+ Short study period

- Very costly

+ Relatively inexpensive

- Suitable for relatively common disease

+ Suitable for rare disease

+ Less selection bias in control group

- Selection bias in control group

+ No recall bias

- Recall bias in both case and control groups

- Direct monitoring of study volunteers is needed + Medical chart review (only paper documents or computer records) is possible

- Attrition problem

+ No attrition problems

+ Incidence rates (i.e., probability of an outcome - Cannot determine incidence rate within a certain time, e.g., annual cancer incidence rate) can be determined

+ Relative risk is accurate

- Relative risk is approximate

Fig. 2.19 Concurrent (parallel) controlled randomized design
(parallel) randomized design (Fig. 2.19) is commonly applied to clinical trials to evaluate the efﬁcacy of new therapy by comparing the average results of the subjects who received the treatment and those of untreated subjects. Well-carried out randomization will keep the distributions of demographic and clinical characteristics of the treated and untreated subjects balanced so that the estimated efﬁcacy would not be confounded with uneven demographic and clinical characteristics, i.e., a randomized study avoids potential bias causing non-sampling errors.
However, the randomized clinical design can be infeasible in some situations such as when the placebo control group cannot be the comparison group (e.g., urgent medical problems that worsen the subject if remain untreated). The single group selfcontrolled cross-over design is to examine whether the difference in the withinparticipant changes over Period 1 and Period 2 are equal (Fig. 2.20).

70

2 Statistical Inference Concentrating on a Single Mean

Fig. 2.20 Self-controlled (cross-over) design
Study Questions
1. What are the mean and standard deviation of the sampling distribution (i.e., standard error) created from the sample means of three numbers randomly drawn from 1, 2, 3, 4, and 5? Note that there will be 10 sample means.
2. What are the two pillars of the frequentist inference? 3. In central limit theorem, what becomes a Gaussian distribution when the sample
size becomes large? 4. Under what circumstance a t-distribution is resorted (instead of the standard
Gaussian distribution) to the inference of a mean? 5. Explain the idea of Observed ~ Expected ~ Standard Error trio formulation of
the test statistic t of a one-sample t-test for a single-mean inference. 6. In this chapter, what is the parameter being inferenced in a one-sample t-test? 7. What is the numerator of the test statistic in a one-sample t-test? 8. What is the denominator of the test statistic in a one-sample t-test? 9. Why is the t-test named as such? 10. Letting alone the statistical signiﬁcance, how should these three values be
interpreted (hint: signal to noise): t = 1.5 (df = 10); t = 2.5 (df = 10); t = 3.5 (df = 10)? 11. Is the 95% conﬁdence interval around the mean the same as the 2.5th to 97.5th percentile range of a sample distribution? 12. Please criticize the following sentences: “These data showed that the outcome was signiﬁcantly different from 50 (p<0.05).” Is the word “outcome” appropriate?
“These data showed that the mean age was signiﬁcantly different from 50 years.” Does this sentence present a p-value?
“These data showed that the sample mean age was signiﬁcantly different from 50 years (p<0.05).” Was the inference made about the sample mean?

Bibliography

71

Bibliography

Beaumont, Geoffrey P. (1980). Intermediate mathematical statistics. Chapman and Hall. Clopper, C.; Pearson, E. S. (1934). “The use of conﬁdence or ﬁducial limits illustrated in the case of
the binomial”. Biometrika. 26 (4): 404–413. Cochran, William G. (2007). Sampling Techniques (3rd Edition). Wiley India Pvt. Limited. Gosset, William Sealy (1908), The probable error of a mean, Biometrika 6 (1): 1–25 Hoel, Paul G. (1984). Introduction to Mathematical Statistics (5th Edition). John Wiley & Sons. Hogg, Robert V.; Tanis, Elliot A. (2010). Probability and Statistical Inference (8th Edition). Prentice
Hall, New Jersey.
Johnson, Norman L.; Kotz, Samuel; Balakrishnan, N. (1994a). Continuous Univariate Distributions, Vol. 1, 2nd Edition. John Wiley & Sons, INC.
Johnson, Norman L.; Kotz, Samuel; Balakrishnan, N. (1994b). Continuous Univariate Distribu-
tions, Vol. 2, 2nd Edition. John Wiley, New York.
Lindgren, Bernard (1993). Statistical Theory, 4th Edition. Chapman & Hall
Mood, Alexander M.; Graybill, Franklin A.; Boes, Duane C. (1974). Introduction to the Theory of Statistics (3rd Edition). McGraw-Hill.
Morton, Richard F.; Hebel John Richard; McCarter Robest J. (1996). A Study Guide to Epidemiology and Biostatistics (4th Edition). Aspen Publishers
Pagano, Marcello; Gauvreau, Kimberlee (1993). Principles of Biostatistics. Duxbury Press. Snecdecor, George W.; Cochran, William G. (1991). Statistical Methods (8th Edition). Wiley-
Blackwell.
Williams, Bill (1978). A Sampler on Sampling. John Wiley & Sons.

Chapter 3
t-Tests for Two-Mean Comparison

In Chap. 2, the one-sample t-test was introduced to test whether a single mean of a population is equal to a certain value. This chapter will introduce the extension of the t-test to examine whether the difference between two population means is equal to a certain value. Two situations will be discussed: the ﬁrst is when the two means are from independent (i.e., unrelated) populations, and the second is when the two means are from related populations.

3.1 Independent Samples t-Test for Comparing Two Independent Means
Example 3.1: Does Medication X Prevent the Low Birthweight Delivery? A prospective two-arm randomized controlled clinical trial (Arm 1: Medication X and Arm 2: Placebo control) was conducted. Because the study design was a randomized clinical trial, the two study groups are independent. The data raw data and result summary of a descriptive analysis are shown below. Birth weights (lb) of 15 newborns from mothers treated with medication X (denoted by Tx) and 15 from mothers with placebo are listed. Let us assume the data are normally distributed and the dispersions of the two distributions are not much different.
Tx (n = 15): 6.9 7.6 7.3 7.6 6.8 7.2 8.0 5.5 5.8 7.3 8.2 6.9 6.8 5.7 8.6 Placebo (n = 15): 6.4 6.7 5.4 8.2 5.3 6.6 5.8 5.7 6.2 7.1 7.0 6.9 5.6 4.2 6.8
(Fig. 3.1)
How can we tackle this problem? The approach to use an extension of the t-test will be introduced ﬁrst, and then the applied result will be illustrated.
Below is the framework for testing the equality of two population means using two independently drawn random samples from normally distributed populations.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023

73

H. Lee, Foundations of Applied Statistical Methods,

https://doi.org/10.1007/978-3-031-42296-6_3

74

3 t-Tests for Two-Mean Comparison

Fig. 3.1 Listed and visualized data from a two-arm randomized clinical trial

Fig. 3.2 Illustration of two means of normally distributed paired continuous outcomes and difference of the two means

The null hypothesis is that the two population means are equal (i.e., H0: μ1 = μ2). Note that “two population means are equal” can be translated into “difference of the two population means = 0 (i.e., H0: μ2 - μ1 = 0)” by which the dimension of the argument is reduced into one (i.e., by letting δ denote the single translated parameter, see Fig. 3.2, where δ = μ2 - μ1) so that the corresponding sample statistic to work with is the observed difference between the two sample means (i.e., by letting
d denote x2 - x1,). Finally, the null and alternative hypotheses are expressed as H0: δ = 0 and H1: δ ≠ 0 for a nondirectional test (or H1: δ > 0 for a directional test examining μ2 > μ1) (Table 3.1).
The idea of Observed Estimate ~ Null Value ~ standard error (SE) trio (see Sect.
2.2.4.5) is applied to derive the test statistic:

t

=

Observed Estimate - Null Value SEðObserved Estimate - Null ValueÞ

=

ðd - δÞ SEðdÞ

=

ðd - 0Þ SEðdÞ

=

d SEðdÞ

:

3.1 Independent Samples t-Test for Comparing Two Independent Means

75

Table 3.1 Null and alternative hypotheses for comparing two means

Simple or compositea

Null hypothesis

Alternative hypothesis Null

Directionality of composite Alternative hypothesis

H0: μ2 -

H1: μ2 - μ1 = δ

Simple Simple

N/A

μ1 = 0

Mean difference is equal

(i.e.,

to δ

μ1 = μ2) Means are

equal

H0: μ2 μ1 = 0
(i.e., μ1 = μ2) Means are
equal

H1: μ2 - μ1 ≠ 0 Means are not equal

Simple Composite Nondirectional (two-sided)

H0: μ2 μ1 = 0
(i.e., μ1 = μ2) Means are
equal

H1: μ2 - μ1 > 0 Mean 2 is greater than
mean 1

Simple Composite Directional

H0: μ2 μ1 = 0
(i.e., μ1 = μ2) Means are
equal

H1: μ2 - μ1 < 0 Mean 2 is smaller than mean 1

Simple Composite Directional

aSimple hypothesis involves a single value of the parameter, and composite hypothesis involves more than one value (e.g., an interval) of the parameter

The numerator is the deviation of the observed sample mean difference (i.e., d = x2 - x1) from the null value of population mean difference (i.e., δ = μ2 – μ1 = 0). Note that δ is set to 0 for the equality test. The denominator, SE(d), is the estimated standard error of the numerator. The simpliﬁed form of this test statistic is t = d/SE (d). Note that a large value of this ratio (positive, negative, or either direction depending on the alternative hypothesis) indicates that the observed deviation of d from the null value of δ = 0 may not be due to the chance alone. According to the sampling theory, under the null hypothesis, if the two population variances are equal, this test statistic will follow the t-distribution with df = (n1 - 1) + (n2 - 1), where n1 and n2 are the sample sizes of the two groups respectively. We can then ﬁnd out how extreme (i.e., unlikely to observe) the observed test statistic t is, if it results from the sample data gathered from the two population distributions under null hypothesis.
The derivation of the standard error, SE(d), may seem complex for beginners. As this book intends not to let the readers plug and play formulae, the conceptual steps of its derivations are shown below. The deﬁnition of the standard error is the

76

3 t-Tests for Two-Mean Comparison

standard deviation of the sampling distribution of a sample statistic (see Sect. 2.1.3

for its deﬁnition). The sample statistic of this case is d, which is the difference

between the two observed sample means, i.e., x2 - x1. So, the standard deviation of the sampling distribution, SE(d), of x2 - x1 is to be derived. Because the standard
deviation is the square root of the variance, the main derivation is to derive its variance. Note that the sample variances of x1 and x2 are s21=n1 and s22=n2, respectively, where s21 and s22 are notations for the variance of the sample distributions (see Sect. 2.1.2 deﬁnition of the sample distribution). The sample variance of

the difference x2 - x1 would increase (see Example 3.2). Note that x1 and x2 vary from sample to sample with its variance s21=n1 and s22=n2, so the difference x2 - x1
would vary greater, and the resulting variance of difference is the addition of the

individual variances s21=n1 and s22=n2. Finally, the denominator of the test statistic is derived, which is the square root of the weighted sum of the two sample variances,

i.e., s21=n1 þ s22=n2. Having derived the test statistic, the inference of Example 3.1 continues as given

below.

p

The test statistic t = (6.26–7.08) / 0:2322 þ 0:2482 = 0.82/0.34 = 2.41 will

follow the t-distribution with df = 28. The critical value of this nondirectional test for

a 5% signiﬁcance level is 1.701, and the critical region is {t > 1.701}. The observed

test statistic from the data, t = 2.41, falls into the critical region (Fig. 3.3).

The p-value can also be calculated using the Excel function TDIST(2.41, 28, 1),

which gives the result 0.0113.

Summary: These data showed that there was a signiﬁcant effect of medication X

to prevent low birthweight delivery (p = 0.0113).

Fig. 3.3 Illustration of the critical region of the directional t-test (applied to Example 3.1)

3.1 Independent Samples t-Test for Comparing Two Independent Means

77

3.1.1 Independent Samples t-Test When Variances Are Unequal

When the two population variances are not equal, the sampling distribution of the aforementioned test statistic will not perfectly follow the t-distribution with df = n1 + n2. This book does not describe computational details but rather discusses how we can diagnose such a phenomenon, and how we can make inferences and
interpret the results provided that the intermediate computational results for the
diagnosis and the calculated SE are made by computer programs or expert statisticians. If heteroskedasticity is detected, then we resort to a theoretical t-distribution with a slightly modiﬁed df. In this case, a modiﬁed df becomes little bit smaller than (n1 + n2 - 2), which is obtained under the homoskedasticity (i.e., equal variance) assumption.
How can we diagnose whether the variances are equal before carrying out the independent samples t-test? There is a test that evaluates it. The null and alternative hypotheses of this test are H0: σ21 = σ22 and H1: σ21 ≠ σ22 (note that the nondirectional alternative hypothesis is its obvious choice). Compute the two sample variances and take the ratio of the two. This ratio will follow an F-distribution under the null hypothesis. The F-distribution has two dfs where df1 = n1 - 1, and df2 = n2 - 1. Then we resort to the F-distribution to determine the p-value (details for computing a p-value from a F-distribution will be shown in Chap. 4). A p-value <0.05 indicates unequal variances, and you would need to take into account this condition during the t-test.
If the data show that the variances are unequal, then a modiﬁed version of the independent samples t-test is recommended. The main work is to make adjustment to the df as well as to estimate the SE of the mean difference, which could become biased if not well taken care of. This monograph does not introduce the mathematical/computational details, but the following is the idea of df adjustment. The idea is that as the df becomes smaller, the tail of the t-distribution becomes thicker, meaning that if the df is adjusted downward for heteroskedasticity, then the observed tstatistic will produce a larger p-value than that would have resulted from the t-distribution with the unadjusted df because the modiﬁed distribution has a heavier tail; therefore, the test becomes conservative.

3.1.2 Denominator Formulae of the Test Statistic for Independent Samples t-Test
Other books offer you plug-and-play formulae for calculating the denominator of the test statistic, which prompt you to plug in the sample sizes and the sample variances of the two groups being compared. The readers do not need to master the computational details of the standard error derivation. However, those are shown in Table 3.2.

78

3 t-Tests for Two-Mean Comparison

Table 3.2 Derivation of the denominator of the test statistic for independent samples t-test

Sample size Unequal
Equal

Population variances Unequal
Unequal

Unequal Equal

Equal Equal

Derived standard error (i.e., denominator of the test statistic t) s21=n1 þ s22=n2
Substitute n for n1 and n2, so s21=n þ s22=n = s21 þ s22 =n
Substitute s2p Ã Ã for s21 and s22, so s2p=n1 þ s2p=n2 = sp2ð1=n1 þ 1=n2Þ
Substitute n for n1 and n2,and s2p Ã Ãfor s21 and s22, so s21=n þ s22=n = 2s2p=n

** s2p = ðn1 - 1Þs21 þ ðn2 - 1Þs22 =ðn1 - 1 þ n2 - 1Þ: This weighted average is called a pooled sample variance

3.1.3 Connection to the Conﬁdence Interval

Having concluded a testing hypothesis to merely claim if the two independent means
are equal, the researchers may further be interested in estimating the size of the mean difference and its conﬁdence interval. The following numerical illustration is to construct a 95% conﬁdence interval for the mean difference using Example 3.1.
The observed mean difference (i.e., the point estimate of the mean difference between the groups with treatment and placebo) = 7.08–6.26 = 0.82, and the standard error of that observed mean difference = 0.339. Finding the 2.5th and 97.5th percentiles of t with df = 28 using Excel, i.e., TINV (0.05, 28) = 2.048 (or using a table for the percentiles of t-distribution). The lower limit of the 95% conﬁdence interval (CI) based on t-distribution with df = 28 is 0.82–2.048 × 0.339 = 0.126, and its upper limit is 0.82 + 2.048 × 0.339 = 1.514, thus 95% conﬁdence interval is 0.126–1.514. Note that although the test was directional, it is a tradition to construct the 95% nondirectionally disregarding the
directionality of testing the hypotheses. Summary: These data showed that the mean of the treated group was greater by 0.82 (95% CI: 0.126–1.514).

3.2 Paired Sample t-Test for Comparing Paired Means
Pairing helps a study design control the subject-to-subject outcome variation in that the responses to a study medication being studied may vary among different subjects and testing whether the average of the within-subject longitudinal changes in response would eliminate the source between subject variation. The following example illustrates the paired sample t-test that is applied to such a clinical investigation (Fig. 3.4).

3.2 Paired Sample t-Test for Comparing Paired Means

79

Fig. 3.4 Illustration of distributions of paired normally distributed continuous outcomes and their paired differences

Example 3.2 Does the use of oral contraceptives (OC) affect systolic blood pressure (SBP)? A self-control (pre-measurement and post-measurement pairs) design is applied, and the following test for H0: μ2 - μ1 = 0 and H1: μ2 - μ1 ≠ 0 is considered.
The following data are the systolic blood pressure measurements in mmHg from 10 study volunteers before using OC (i.e., baseline) and those taken from the same 10 women after the use of OC for a certain period (i.e., follow-up). Let us assume that the distribution of the SBP values in the population is a Gaussian distribution.

Subject 1 2 3 4 5 6 7 8 9 10

Baseline SBP 115 112 107 119 115 138 126 105 104 115

Follow-up SBP 128 115 106
12 122 145 132 109 102 117

The following preliminary data analysis (i.e., data reduction) was performed and summarized in Table 3.3. Do these summary statistics describe the phenomenon clearly?
While the summary statistics in Table 3.3 and the graphical display of the data depicted by Fig. 3.5 described that the mean SBP at follow-up was slightly elevated and the variability was also slightly increased, the paired nature of the design was not reﬂected. However, Fig. 3.6 illustrates a typical graphical description of the paired data in that most subjects showed an increase in the follow-up SBP, and the second panel described that the blood pressure range of the subjects was larger than the average within-subject change.

80

3 t-Tests for Two-Mean Comparison

Table 3.3 Sample statistics of the SBP at baseline and follow-up

n Mean Median Standard deviation Standard error of the mean

Baseline SBP 10
115.00 115.00
10.31 3.26

Follow-up SBP 10
120.40 119.50
13.22 4.18

Fig. 3.5 Illustration of inappropriate box-and-whisker plot for paired continuous outcomes

Fig. 3.6 Illustration of properly displayed paired continuous outcomes
A signiﬁcance level of 5% is adopted. The test is derived based on the Observed ~ Null Value ~ SE trio. To complete this task, let us revisit the articulated hypotheses. By letting δ denote μ2 – μ1, the null and alternative hypotheses are rewritten as H0: δ = 0 and H1: δ ≠ 0. The observed estimate of δ, d = 4.80, is the sample mean of 10 changes in SBP between the baseline and follow-up, and the null value of δ is 0. The SE(d – δ) = SE(d) is directly estimated from the 10 within-

3.2 Paired Sample t-Test for Comparing Paired Means

81

Fig. 3.7 Illustration of calculated paired differences that can be analyzed by a one-sample t-test
Fig. 3.8 Box-and-whisker plot applied to describe the distribution of differences calculated from pairs of continuous outcomes
subject longitudinal changes. The resulting test statistic is the same as that of the one-sample t-test as the within-subject change is treated as the unit of analysis, which is depicted by Figs. 3.7 and 3.8. The test statistic t = (d – δ)/SE(d – δ) = d/SE(d) follows the t-distribution with df = n of pairs – 1 = 10 – 1 = 9.
The resulting t = (4.8–0) /1.44 = 3.324 and p-value is 0.0089, which indicates that the mean change of 4.8 was signiﬁcant at 5% signiﬁcance level.
The paired sample t-test, as the baseline and follow up data (i.e., two outcome variables) are reduced into the pairwise within-subject differences (i.e., transformed one outcome variable), is the same as the one-sample t-test, and df is the number of unique individuals – 1.
A common misapplication of the t-test to the paired data is to apply the independent samples t-test. As illustrated below, such an erroneous application would mislead the study investigation. If the equal population variance independent samples t-test (see Table 3.2 in Sect. 3.1.2) had been applied to the above example, then

82

3 t-Tests for Two-Mean Comparison

the test statistic’s denominator, SE ðx2 - x1Þ, where x1 and x2 are the baseline and
follow-up mean SBPs, respectively, turns out to be s1=n þ s2=n = 2s2p=n, where
s2p = ðn1 - 1Þs21 þ ðn2 - 1Þs22 =ðn1 - 1 þ n2 - 1Þ, n1 = n2 = 10, s21 = 10:312, and s22 = 13:222. The calculated value of this standard error is 5.3028. Then the test statistic t = (4.8–0) /5.3028 = 0.91 with df = 18. The p-value is 0.3773, which is contradictory to the result from the paired sample t-test.

3.3 Use of Excel for t-Tests
The one-sample t-test, independent samples t-test, and paired sample t-test can easily be carried out using Excel (Figs. 3.9, 3.10, and 3.11).

Fig. 3.9 Use of Excel function FTEST for testing equality of two variances

3.3 Use of Excel for t-Tests

83

Fig. 3.10 Use of Excel function for independent samples t-tests
Study Questions
1. Explain the idea of Observed ~ Expected ~ SE trio formulation of the test statistic t of the independent samples t-test.
2. Can the paired sample t-test be conceived as a one-sample t-test? Why? 3. What is the idea behind to adjust (i.e., decrease) the degrees of freedom for the
independent samples t-test when the variances of the two populations are not equal? 4. Please criticize the following awkward sentences:
“These data showed that the two groups were signiﬁcantly different (p<0.05).”
“These data showed that the two group-speciﬁc means were signiﬁcantly different.”
“The calculated p-value was less than 0.05, thus we rejected the null hypothesis.”

84

3 t-Tests for Two-Mean Comparison

Fig. 3.11 Use of Excel function for paired sample t-test
Bibliography
Dawson, Beth; Trapp, Robert G. (1994). Basic & clinical biostatistics (4th Edition). Appleton & Lange.
Glantz, Stanton A. (2005). Primer of Biostatistics (6th Edition). McGraw Hill Professional. Pagano, Marcello; Gauvreau, Kimberlee (1993). Principles of Biostatistics. Duxbury Press. Rosner, Bernard (2010). Fundamentals of Biostatistics (7th Edition). Cengage Learnings, Inc. Snedecor, George W.; Cochran, William G. (1991). Statistical Methods (8th Edition). Wiley-
Blackwell. Zar, Jerrold H. (2010). Biostatistical Analysis (5th Edition). Prentice-Hall/Pearson.

Chapter 4
Inference Using Analysis of Variance (ANOVA) for Comparing Multiple Means

This chapter discusses single-factor analysis of variance (ANOVA), which is mainly applied to compare three or more independent means. The term “single factor” refers that the means are compared across levels of a “single” classiﬁcation variable (i.e., classiﬁcation of means by a single categorical variable). The classiﬁcation variable is called the independent variable or factor (thus, the method is also called single-factor ANOVA), and the outcome variable whose means are compared is called the dependent variable. This method requires certain assumptions: (1) the dependent variable values are the observations sampled from a normal distribution and (2) the population variances are equal (homoskedasticity) across the levels of the independent variable (Fig. 4.1).
Having mentioned that ANOVA is mainly applied to compare three or more means, is it obvious why the variance is analyzed to compare the means? The ANOVA is a two-step procedure. The ﬁrst step is to measure two partitioned amounts of outcome data variations due to two sources of variations of which the ﬁrst is the variation of the outcome variable explained by the groups being compared, and the second is the unexplained residual (error) variation. The next step is to utilize these two amounts of data variations to carry out the hypothesis testing for comparing the means. These amounts of data variations are measured by means of sums of squares (see Sect. 1.3.4).

4.1 Sums of Squares and Variances

In Fig. 4.2, three groups of sample data that are clearly separated (for illustrative purposes) by the underlying group effects are illustrated, and the symmetrically scattered outcomes around their group means reveal the random sampling error. ANOVA is to examine whether the group effect depicted by the distances among the three group means can separate the group-wise data cluster in the presence of the random sampling error. The group effect (i.e., signal) and the random sampling error

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023

85

H. Lee, Foundations of Applied Statistical Methods,

https://doi.org/10.1007/978-3-031-42296-6_4

86

4 Inference Using Analysis of Variance (ANOVA) for Comparing Multiple Means

Fig. 4.1 Illustration of homoskedastic density curves of three continuous outcomes with unequal means

Fig. 4.2 Illustrated deviations arising from comparing three independent means
(i.e., noise) are measured by two kinds of sums of squares ﬁrst and then transformed into kinds of average sum of squares (so-called mean squares).
Figure 4.3 demonstrates the concepts of derivations and these sums of squares (SS) that are the underpinnings of ANOVA. For instance, the ﬁrst observed outcome value 5 (Sample Number 1 in Group 1) is conceived as a value deviated by -15 from the single grand mean of 20; this outcome is also conceived as a value deviated by 5 from its group mean 10 wherein this group mean is deviated by -10 from the grand mean 20. It is obvious that the three corresponding squared deviations are (15)2 = 225, (-5)2 = 2 5, and (-10)2 = 100. Repeating the same calculation over every observation within each group and cumulating the resulting individual squared deviations into the group sum totals, and then summing the group sum totals over the three groups ﬁnally produces three kinds of sums of squares. These three sums of squares are called total-sum, between-sum, and within-sum of squares, and the resulting values are 950, 800, and 150, respectively. Intuitively, the total sum of squares of 950 is partitioned into between-group sum of squares of 800 and withingroup sum of squares of 150 (Fig. 4.4).
Dividing a sum of squares by a divisor (i.e., degrees of freedom, see Sect. 2.2.3) is a kind of variance. In AVOVA, two such variances are compared, and these are the

4.1 Sums of Squares and Variances

87

Fig. 4.3 Formulae-less numerical illustration of sum of squares in single-factor ANOVA Fig. 4.4 Partitioning sum of squares for single-factor ANOVA

88

4 Inference Using Analysis of Variance (ANOVA) for Comparing Multiple Means

variance due to the group difference and the variance due to the sampling error (i.e.,
unexplained residual variance by the systematic group difference). To distinguish
the meaning of the ordinary variance, these variances obtained in AVOVA are termed “mean squares” (MS). The two major mean squares in the single-factor ANOVA are the between-group mean square (MSbetween) and the within-group mean square (MSwithin). MSbetween is the sum of squared deviations from the grand mean for all data values divided by its divisor for which the divisor is one less than
the number of unique deviations of the group means from the referenced overall
grand mean. Note that as demonstrated in Fig. 4.3, there are only three unique deviations of the group speciﬁc means from the grand mean, and the MSbetween becomes (400 + 0 + 400)/(3–1) = 800/2 = 400. MSwithin is (50 + 50 + 50)/[(number of unique deviations of the individual data points from Group 1 mean - 1) + (number of unique deviations of the individual data points from Group 2 mean - 1) + (number of unique deviations of the individual data points from Group 3 mean - 1)] = 150/ [(4–1) + (4–1) + (4–1)] = 150/9 = 16.67.

4.2 F-Test
With the calculated between-group and within-group mean squares, the hypothesis test to compare means can begin. The null and alternative hypotheses are H0: μ1 = μ2 = μ3, . . ., = μk (all means are equal) and H1: At least one mean is different (not all means are equal). The test statistic is derived under the assumption that the data are drawn from normally distributed populations and that the population variances are equal across the groups. Unlike the t-tests, the test statistic of this test is not derived from the “trio” (see Sects. 2.2.4.5 and 3.1). Instead, the ratio of the two mean squares (see Sect. 4.1) is used as the test statistic. By letting F denote this ratio, F = MSbetween/MSwithin is the signal (of the between-group difference)-to-noise (random sampling error)-ratio test statistic. Under the null hypothesis, this test statistic follows an F-distribution. which is characterized by two degrees of freedom, dfbetween and dfwithin. Each df is the divisor that is used for calculating the mean squares, and each divisor is one less than the number of unique squared deviations from the referenced mean that are summed into the SS calculations.
F = 400/16.67 = 23.99 is a speciﬁc value of the test statistics from the observed data that will follow the F-distribution with dfbetween = 2 and dfwithin = 9. The method to identify the rejection region using the F-distribution table and the graphical illustration are illustrated in Fig. 4.5. Evidently, such a large value 23.99 fell into the rejection region (i.e., F > 4.26); thus, these data showed the evidence to reject H0 at a 5% signiﬁcance level.
Example 4.1 A cross-sectional study design is applied to examine if a mother’s smoking affected the offspring’s birth weight. The null and alternative hypotheses for the inference are

4.2 F-Test

89

Fig. 4.5 Determination of the rejection region of the F-distribution for an ANOVA F-test for comparing three means
H0: μ1 = μ2 = μ3 = μ4 (all four means are equal) and H1: At least one mean is different (not all means are equal).
The following data are birth weights (lb) of 27 newborns classiﬁed by their maternal smoking status (i.e., one-way classiﬁcation). The data normality is assumed.

90

4 Inference Using Analysis of Variance (ANOVA) for Comparing Multiple Means

Fig. 4.6 Box-and-whisker plot of four distributions whose mean are compared by single-factor ANOVA F-test

Table 4.1 ANOVA table

Source of variation SS

Df MS F

p-value

Between group

11.74 3 3.91 4.20 0.017

Within group

21.28 23 0.93

Total

33.02 26

Group 1 – Mother is a nonsmoker (n = 7): 7.5 6.2 6.9 7.4 9.2 8.3 7.6 Group 2 – Mother is an ex-smoker (n = 5): 5.8 7.3 8.2 7.1 7.8 Group 3 – Mother smokes <1 pack/day (n = 7): 5.9 6.2 5.8 4.7 8.3 7.2 6.2 Group 4 – Mother smokes ≥1 pack/day (n = 8): 6.2 6.8 5.7 4.9 6.2 7.1 5.8 5.4
(Fig. 4.6)
The means and standard deviations are obtained as the descriptive summary statistics. Besides these descriptive statistics, a summary table (so-called ANOVA table) is traditionally to present the sum of square and mean square for each source of variation as well as the test statistic F and its p-value (see Table 4.1).
Computation of the sums of squares and mean squares can be done either by a computer package program or manually. As this chapter does not offer directly usable computational formulae, this illustration is made in order to solely walk you

4.2 F-Test

91

through the essential computations of the variance partitioning and the derivation of
the test statistic. The ﬁrst quantity to calculate is how much the group means deviate from the
grand mean, where the grand mean, xG, is the weighted average of the four means, where the weights are the group sample sizes, i.e., xG = (7 × 7.59 + 5×7.24 + 7 × 6.33 + 8 × 6.01)/27 = 6.73.
SSbetween is the summation of seven of the squared unique Group 1 sample mean’s deviation from the grand mean, i.e., 7× ðx1 - xGÞ2, summation of sum of ﬁve of such squared value from Group 2, i.e., 5× ðx2 - xGÞ2, 7× ðx3 - xGÞ2 from Group 3, and 8× ðx4 - xGÞ2 from Group 4, which is 7 × (7.59–6.73)2 + 5 × (7.24–6.73)2 + 7 × (6.33–6.73) 2 + 8 × (6.01–6.73) 2 = 11.74.
SSwithin is the summation of four within-group sum of squares whose each withingroup sum of squares is obtainable from the already calculated within-group standard deviation. Because the variance within kth subgroup sk2 is SSk within/(nk - 1), where k = 1, 2, 3, and 4 indicating the group, the sum of squared deviation within the kth subgroup SS k within = sk2× (nk - 1). Therefore, SSwithin from all four groups = (7–1) × 0.962 + (5–1) × 0.912 + (7–1) × 1.142 + (8–1) × 0.722 = 21.28.
Finally, F = MSbetween/MSwithin = [11.74/(4–1)]/[21.28/(7–1) + (5–1) + (7–1) + (8–1)] = (11.74/3)/(21.28/23) = 3.91/0.93 = 4.20. Finally, the p-value
was directly evaluated by using Excel (not by determining the critical region and seeing if the observed F value fell into the critical region), i.e., p = FDIST(4.20, dfbetween = 3, dfwithin = 23) = 0.017. Figure 4.7 is the graphical demonstration for this calculation wherein the density function is an F-distribution characterized by the
two required degrees of freedom and 3 and 23 appeared differently from the one with
degrees of freedom 2 and 9 (Fig. 4.5). Note that the shape of the density curve of an
F-distribution is characterized by the degrees of freedom (i.e., not all F-distributions
look similar). Although the p-value was directly calculated in this example, the signiﬁcance can
also be determined by checking if the test statistic fell into the critical region. The critical region of this test is F > 3.03, which is found from the critical value table of F-distribution at 5% signiﬁcance level with the between- and within-group degrees
of freedom of 3 and 23. The suggestive format of the summary sentence is either “These data showed that
at least one age group’s mean birth weight is signiﬁcantly different from the means of the other age groups ( p = 0.017)” or “These data showed that at least one age group’s mean birth weight is signiﬁcantly different from the means of the other age groups at 5% signiﬁcance level.”

92

4 Inference Using Analysis of Variance (ANOVA) for Comparing Multiple Means

Fig. 4.7 p-value calculation of the F-statistic obtained from an ANOVA F-test comparing three means using the data set illustrated in Fig. 4.6
4.3 Multiple Comparisons and Increased Chance of Type 1 Error
Suppose H0: μ1 = μ2 = μ3 = μ4 in favor of H1: At least one mean differs from the others was rejected by a one-way ANOVA F-test. It remains ambiguous which speciﬁc means differed until all possible pair-wise tests (six independent samples t-tests can be performed in this case) comparing two groups at a time. In doing so, the increased number of tests increases the potential Type 1 error. To protect from such an increased chance of committing Type 1 error, a stringent criterion (i.e., modiﬁed signiﬁcance level) for these tests needs to be adopted. One of such options is to lower the signiﬁcance level of each test by dividing it by the number of comparisons (Bonferroni’s correction, i.e., adjusted α = 0.05/6 = 0.0083, or to inﬂate the calculated p-value by multiplying it by the number of comparisons, i.e., inﬂated p-value = 6 × observed p-value). However, this tends to be too conservative as the number of tests increases. Many “not-too-conservative but powerful” tests have been invented. The least signiﬁcance difference (LSD), highly signiﬁcant difference (HSD), student-Newman-Keuls (SNK), and Duncan’s multiple range test procedures are popularized procedures applying pair-wise multiple tests for comparing means, and Dunnett’s procedure is a procedure to compare ordered groups with a baseline group using modiﬁed critical values of the test statistic.

4.4 Beyond Single-Factor ANOVA

93

4.4 Beyond Single-Factor ANOVA 4.4.1 Multi-factor ANOVA

As the number of categorical independent variables increases and the outcomes are classiﬁed by these independent variables, the ANOVA F-tests will involve the partitioning of the total variance into the between-group variances due to the effects of these individual independent variables, the between-group variances due to the interactions of two or more independent variables, and the remaining variance that is not explained by those sources of variations that are already taken into account. The testing procedure for comparing subgroup means (e.g., the difference of the means among the levels of the ﬁrst independent variable, that among the levels of the second independent variable, and that among the levels arising from the interaction of the ﬁrst and second independent variables) is F-test whose test statistic’s numerator is the mean square due to the between-group effect of interest, and the denominator is the unexplained error mean square. With a ﬁrm understanding of the singlefactor ANOVA, it can be a trivial calculation problem to extend the method to the multifactor ANOVA. However, it would be worth addressing the deﬁnition and interpretation of the two topics discussed in Sects. 4.4.2 and 4.4.3.

4.4.2 Interaction
The following example focuses on illustrating the deﬁnition of interaction as well as the marginal means, main effects, and simple means arising in the two-factor ANOVA. The example is self-explanatory and does not necessitate verbal deﬁnitions.
Example 4.2 A study of abdominal fat reduction (measured in % reduction) after 8-week programs of exercise alone or diet and exercise conducted involving two age groups of age < 50 years and age > =50 years. The sample size of each subgroup was 2000. Let us assume that every observed difference was statistically signiﬁcant (Table 4.2).

Table 4.2 Means of % fat reduction by age group and program

Age < 50 Age > =50

Program 1: Exercise alone
n = 2000 Mean reduction = 5%
n = 2000 Mean reduction = 3%

Program 2: Exercise and diet
n = 2000 Mean reduction = 8%
n = 2000, Mean reduction = 4%