DissLiteratur/storage/LNMQY3G8/.zotero-ft-cache

2018 IEEE International Symposium on Software Reliability Engineering Workshops

Using Supervised Learning to Guide the Selection of Software Inspectors in Industry

1Maninder Singh Department of Computer Science
North Dakota State University Fargo-USA
maninder.singh@ndsu.edu

2Gursimran Singh Walia Department of Computer Science
North Dakota State University Fargo-USA
gursimran.walia@ndsu.edu

3Anurag Goswami Department of Computer Science
Bennett University Greater Noida-India anurag.goswami@bennett.edu.in

Abstract--Software development is a multi-phase process that starts with requirement engineering. Requirements elicited from different stakeholders are documented in natural language (NL) software requirement specification (SRS) document. Due to the inherent ambiguity of NL, SRS is prone to faults (e.g., ambiguity, incorrectness, inconsistency). To find and fix faults early (where they are cheapest to find), companies routinely employ inspections, where skilled inspectors are selected to review the SRS and log faults. While other researchers have attempted to understand the factors (experience and learning styles) that can guide the selection of effective inspectors but could not report improved results. This study analyzes the reading patterns (RPs) of inspectors recorded by eye-tracking equipment and evaluates their abilities to find various faulttypes. The inspectors’ characteristics are selected by employing ML algorithms to find the most common RPs w.r.t each faulttypes. Our results show that our approach could guide the inspector selection with an accuracy ranging between 79.3% and 94% for various fault-types.
Keywords—Fault types, classifiers, eye tracking, reading patterns, inspector selection, machine learning

I.

INTRODUCTION

Leading software companies employ inspections (defined by Fagan [1]) to find and fix faults early and avoid costly re-work later. Multiple studies estimate that costs saved by performing early inspections to find faults (especially requirements where they are cheapest to find and fix) vs. testing can vary up to 17:1 work hours [2]. While inspections are useful, its effectiveness is reliant on the selection of skilled inspectors. Researchers and practitioners have tried to understand the background information (experience, education, personality etc.) to predict the performance of individual inspectors but have not been successful [3]. In fact, results at Microsoft and other major software companies showed that most skilled software inspectors had less experience and had nontechnical background [3].

Motivated by these findings, Goswami et al., [4] leveraged the research from Psychology to show that Learning Styles (LS) can be used to select a team of inspectors. Goswami et al., conducted an industrial empirical study and reported that selecting inspectors with most dissimilar LSs can result in improved fault coverage. This finding was consistent with the results from a large-scale study at Microsoft [3] that managers tend to include inspectors from varied background, especially those with non-computer backgrounds. While LSs seemed useful in selecting inspection teams, Goswami et al. [2], [4], [5] were not able to find common LSs that were positively correlated to individual inspection performance across studies. One of the reasons for this was that SRS documents

are generally developed in NL and are not tailored to LSs of specific readers. One of the major results from these studies was that individual inspector (even within same LS category) exhibit different reading patterns (RPs) depending on the type of SRS being inspected which in turn impacts their ability to report faults. While LS is an abstract model of capturing the RPs of inspectors, more objective means of capturing RPs would help project managers identify skilled inspectors. We believe that inspectors’ RPs are generalizable across SRS documents.
Additionally, past research has identified that software organizations need inspectors that can detect specific fault types (e.g., Ambiguous - A, Inconsistent Information - II, Omission - O, Incorrect Fact - IF) at a higher rate. Therefore, this research tries to characterize the RPs of inspectors and their capability at detecting various requirement fault types.
To characterize RPs, an eye-tracking apparatus was used in a controlled environment wherein software engineers (with industry experience) reviewed an industrial strength NL requirement document and reported faults. We collected several metrics to examine the RPs of inspectors (e.g. eye movements), their cognitive processing and their fault detection abilities across different areas of SRS (See Figure 1). Following are eye-tracking metrics that were collected and are being analyzed in this research:
Ɣ Fixation: is a point where eyes are relatively stationary and an individual is taking in the information.
Ɣ Saccade: Quick eye movement between fixations. Ɣ Scan paths: are a complete saccade-fixation-saccade
sequence and interconnecting saccades. Ɣ Gaze: is the sum of fixations’ duration in an area. They are
also known as “dwell”, “fixation cluster”, or “fixation cycle”. Ɣ The region of Interest (ROI): is an analysis method where eye movements that fall under certain area are evaluated. Next, to better analyze the RPs of inspectors and predict their inspection effectiveness, Machine Learning (ML) algorithms (principal components and classification) are being employed. The choice of ML algorithms is motivated by previous research [6], where it was found that ML algorithms have varying prediction accuracy for various fault-types. This paper is validating ML algorithms w.r.t RPs for each faulttype. We are applying ML and principal components to be able to best predict the capability of an individual inspector to report various faults-types (more details appear in section III). We used an open source commercial ML tool (WEKA Waikato Environment for Knowledge Analysis) for

978-1-5386-9443-5/18/$31.00 ©2018 IEEE

12

DOI 10.1109/ISSREW.2018.00-38

Authorized licensed use limited to: Technische Informationsbibliothek (TIB). Downloaded on July 25,2024 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.

Figure 1. Sample Reading Pattern Showing Fixations and Scanpaths
implementing ML algorithms. This paper reports the results regarding the common RPs and most effective ML algorithm that can assist project managers select the most effective inspectors.
II. BACKGROUND
Eye movements were first analyzed using mirrors in Javal’s gaze motion research that was conducted in 1879. Javal noticed that people do not read in a linear fashion and instead they incorporate fixations and saccades [2], [7]. Researchers have conducted eye-tracking studies in domains like marketing research [8], evaluation of computer interfaces [9], common interactive tasks [10] and in game-based learning environments [11].
Researchers have also combined eye-tracking studies and ML techniques for prediction, based on an improved model selection technique [12]. For example, ML when applied on the eye-tracking data was able to train models of saliencybased on image features (low, medium, and high) that outperformed existing models. With respect to RPs, ML research [12] was applied on eye-tracking data of participants that were asked to read different document types (a novel, a manga, a fashion magazine, a newspaper, and a textbook) in reading situations (an office, a coffee shop, a home setting, a library and a lecture hall) to develop an automated classification system. The results of this study showed 74% accuracy in the document recognition using user-independent training. This motivated our research where we planned to use ML approach to automate the selection of skilled inspectors based on their eye movement data. ML research [13] when applied on eye tracking data (e.g., fixation, reading time) has been successfully used to detect dyslexia. They concluded that the RPs of a reader with dyslexia differs (i.e. more number of fixations, long fixations, long read time) from a regular reader. After training dataset of 1135 participants with and without dyslexia in a 10-fold cross experiment, their method predicted dyslexia with an accuracy of 80.18%. Motivated by these results, Goswami et al [14], conducted eye-tracking experiment to investigate RPs and LSs of inspectors to enhance inspection team performance. Their results found that certain RPs improved inspection outcome but were not generalizable and their calculation required certain metrics (e.g., ROIs, # of seeded faults) to be known prior to the inspection.
Inspired by these studies, the current research applies ML on the eye-tracking data (to find common RPs) of 39 participants captured during the inspection task to evaluate their ability to report various fault-types. Next, prominent classifiers (chosen based on their individual performance at fault classification) from 5 different classification families were compared (by manipulating various independent variables) at classifying

each fault-type (discussed in section 1). More details regarding experiment are presented in next section.
III. EXPERIMENT DESIGN
A. Experiment Methodology
Research Questions (RQs): There were following two research questions that were investigated in this study:
RQ1: What reading patterns (RPs) of an inspector are most effective at reporting various fault-types?
This RQ focuses on various features that can predict the ability of an inspector to report potential fault-types. For this purpose, prominent RPs were evaluated through attribute selection techniques (e.g. information gain). More details appear later in this section.
RQ2: What type of ML algorithms can best predict inspectors’ effectiveness for various fault-types?
The focus of this RQ is to find the best-suited classifiers that could accurately predict most effective inspector for each fault-type (details discussed below). This RQ analyzes different classifiers with most informative features to predict outcome using various evaluation metrics (i.e. precision, recall, false positive rate, and F-measure). The visualization of results is presented with receiver operating characteristic (ROC) curve.
To investigate the above RQ’s, the following dependent and independent variables were manipulated.
Independent variables: These variables reported the impact of one or more dependent variables and are as follows:
a) Type of classifiers: The classifiers from 5 different classification families (i.e. Bayesian, Support Vector, Ensemble, Trees and Lazy Learners) were chosen based on their applicability and performance as reported by prior inspection studies in literature [6].
b) Fault types: The SRS document (Parking Garage Control System) evaluated in this study contained a total of 35 seeded faults that were further divided among 6 faultcategories (i.e. ambiguous (A), inconsistent information (II), Incorrect fact (IF), Omission (O) and Extraneous (E) and Miscellaneous (M)). The fault concentration per fault category out of 35 seeded faults was A:4, II:12, IF:2, O:13, E:3 and M:1. The fault-types E and M were excluded from the study because these fault-types were only detected by at most 2 inspectors making this fault-type extremely imbalanced (i.e., likelihood of imbalanced distribution of instances in a binary classification problem [6]). Sampling E and M fault-types did not produce good samples and were not included for analysis.
c) Attribute Evaluators: Eye-tracking equipment recorded 21 RP attributes (see section III.B) for each inspector. To select prominent attributes (features or principal components) that could accurately characterize the RPs of inspectors for all fault-types, three algorithms were selected and evaluated (Subset evaluation, information gain and wrapper method) based on literature [15], [16]. Table I provides a final list of fault-types, classifiers and attribute evaluators for all fault types. More details about classifiers and attribute selection appear later in this section.

13 Authorized licensed use limited to: Technische Informationsbibliothek (TIB). Downloaded on July 25,2024 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.

d) Training and Test Set: The eye-tracking study used Parking Garage Control System (PGCS) SRS. PGCS was developed by Microsoft, contains 35 realistic faults that were seeded by the original authors, and is used by Microsoft to train their employees on inspections. The eye tracking data used in this paper was generated from 39 participants (majority of whom had around 2 years of SE experience). Post inspection, we labeled final class attribute with binary class labels (‘yes’ or ‘no’ to denote the capability of an inspector to report a fault or not). If an inspector was able to report faults at least greater than or equal to mean of all the faults found within each fault-type, then that inspector is labeled with ‘yes’ in final class category. Final class label is required to evaluate prediction results of ML algorithms. For example, the mean number of faults found by each inspector for fault-type II was 1.4; any inspector reporting 2 or more true II faults is labeled capable of reporting fault-type II (i.e. has a value ‘yes’ in final class attribute).
e) Validation method: Throughout our experiment, we used 10-folds cross-validation method as it is most commonly used method to measure model performance. All the classifiers and ML approaches evaluated/used in this study were tested with default parameters, unless specified otherwise.

Dependent Variables: The following variables were collected to measure the effect of independent variables and acted as evaluation metrics.

a) Recall or true-positive rate (TP rate): the proportion of true positives that are correctly identified (i.e. sensitivity).

b) False-positive rate (FP rate): the ratio of number of negative events wrongly categorized as positives over total number of actual negative events.

c) Precision: the fraction of relevant instances among the retrieved instances w.r.t true-positives.

d) F-measure: is a measure of test’s accuracy and it considers both precision and recall.

e)

ROC (Receiver Operating Characteristics): is a

graphical plot that illustrates the diagnostic ability of a

binary classifier system as its discrimination threshold is

varied. It is created by plotting true-positive rate and false-

positive rate. The observations and results have been derived

from F-measure and ROC curve.

B. Experiment Procedure
The experiment procedure consists of following six steps as shown in Figure 2. The description of each step is presented briefly in this section.
1. Inspection Data: The inspection data generated from the eye-tracking study [14] consisted of 21 different attributes (Features). The SRS was divided into three sections (introduction, general description and functional description) and the time was evaluated on these sections separately to better understand the impact of RPs. Due to space restriction,

Inspection Data

Principal Feature Extraction

Classification Approaches

Preprocessing

Sampling

Results

Figure 2. Overall Experiment Procedure

TABLE I. VARIOUS INDEPENDENT VARIABLES TAKEN FOR THIS STUDY

Fault types Classifiers

Attribute Evaluators

Type A,

NB, RF, Lazy Learner, Classifier subset evaluator,

Type IF,

MNB, Ensemble

Information Gain,

Type II,

(AdaBoost, Voting,

Wrapper Subset Evaluation

Type O

Bagging)

only specific description about 21 attributes is presented in Table II.

2. Preprocessing: Some features like ‘linear saccade per page and time taken’ etc. consisted of suffixes (e.g. % symbol, minutes) that required removal before being processed. The binary class attribute ‘actual faults’ contained total number of true-faults found by an inspector. This attribute was manually processed (explained in III.A.d) by one of the authors to represents the final class label (‘yes’ or ‘no’) for all the instances. The attributes id, total fixation at
ROI, total time duration at ROI, total faults, efficiency and false positives were removed from the analysis because this experiment aimed at providing automated and generalizable attributes for determining the capability of an inspector postinspection. These six attributes are manually calculated postinspection by one of the authors, so these do not contribute towards general adaptability of features to other SRS documents where information about seeded faults (i.e., ROI or regions with seeded faults) is not known. The final dataset had 15 total features including final class attribute.

3. Sampling: The data collected for PGCS document consisted of an uneven number of true-positive and falsepositive instances; showing the class-imbalance problem. So, one of the sampling techniques i.e. SMOTE (Synthetic Minority Oversampling Technique) was applied with WEKA over the data to artificially generate minority class instances. The selection of SMOTE technique is based on its precedent performance on imbalanced data in the literature [17]. The data was then shuffled randomly (using Randomize filter in WEKA) to select unbiased training and test sets during validation stage of the experiment.

4. Principal Feature Extraction: Three types of techniques were used to evaluate best performing features for the given set of data. These techniques are based on ‘Classifier subset evaluation’, ‘Information gain’, and ‘Wrapper methods’. More details on these techniques can be found in [15], [16]. The selection of these techniques was based on their performance to rank principal attributes over 14 well-known benchmark datasets for classification and these selectors are well applicable to binary class problems.

5. Classification Approaches: We applied classifiers from five different classification families (discussed earlier). These

TABLE II. VARIOUS ATTRIBUTES EVALUATED IN THIS STUDY

Categories

Attributes (total 21)

General

id (# assigned as identifier)

Fixation data per Average fixation time, fixations per page, time spent per

page

page, Linear saccades per page

Fixation data at ROI Total fixation at ROI, total time duration at ROI

(region of interest)

# of time inspector Total lookups in introduction, total lookups in general

went back to search description, total lookups in functional, total # of searches

an information

Time taken to search Time spent on reading introduction, time spent on reading

an information

description, time spent on reading functional

requirements, total search time

Inspection

Total faults in SRS, false-positives by inspector, faults

performance

reported by inspector, actual faults (effectiveness), total

time taken, efficiency (fault rate)

14 Authorized licensed use limited to: Technische Informationsbibliothek (TIB). Downloaded on July 25,2024 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.

classifiers were naïve Bayes (NB), Multinomial NB (MNB), Decision Trees (DT), Random Forest (RF), Lazy Network (Locally weighted learning), Stochastic Gradient Descent (SGD), and Ensemble (AdaBoost, Bagging, and Voting). This research evaluated RQs’ based on these classifiers and discussed results with classifiers that outperformed others in their classification family.
6. Results: The performance evaluation was performed by collecting various metrics as described in section III above. The results are shown w.r.t F-measure and ROC; because these are few metrics that are considered prominent among standard benchmarks to measure classification performance.
The next section presents the results and discussion regarding this experiment.
IV. RESULTS AND DISCUSSION
This section presents the result on best classification approaches and prominent reading patterns that can predict inspector’s ability to report a specific fault-type. The experiment was executed using WEKA (version 3.8.2) knowledge analysis tool to execute algorithms and analyze results. The results were obtained by manipulating independent variables (described in Section III.A) and measured their impact on dependent variables. Due to space restriction, only prominent results are shown in Table III for all fault-type. The area under ROC curve is considered as a prominent metric to evaluate classification performance based on features selected by varying classifiers over three attribute evaluation methods (see Table I). The best performing classifiers, prominent features and evaluation results (%age of ROC) are highlighted with bold/underline in Table III. The percentage of ROC is used as performance evaluator for all fault-types. The results organized around the two RQs (Section II) are as follows:
RQ1: Reading patterns versus inspectors’ ability to report actual fault-type
This research question is aimed at finding which RPs (collected during the eye-tracking study) can help determine inspector’s ability at finding a specific fault-type. The experiment was evaluated using prominent features extracted in four ways to train a classifier; out of which, three attribute evaluator methods were used to extract features (See Table I

for attribute evaluators) and the fourth method considered all available features (total of 15 features as explained in section III.B.2). The key result findings (from Table III) are shown below:
Ɣ Most prominent feature set: The result shows that there were few prominent reading pattern features (out of 15) that were commonly ranked higher across all attribute evaluators used to predict inspection effectiveness for all fault-types. These features included average fixation time, linear saccades per page, fixations per page, time spent per page, and total lookups in functional. Using these subsets of features resulted in an improved prediction accuracy.
Ɣ Other prominent features: In addition to above features, metrics related to time spent fixating or searching on different parts of SRS (or fixating or searching/lookups) strengthened the prediction results. Specifically, average fixation time, and total number of search time were most informative features. This is important result because companies rely on selecting inspectors that find faults faster to enable maximum cost savings. Evaluating the reading patterns with respect to the time spent can help characterize inspectors’ performance better as demonstrated in this research.
Ɣ Out of four different evaluator methods used in this research, Wrapper Subset Evaluation resulted in largest improvement in prediction accuracy. The percentage of AUC-ROC gain (shown in last column in Table III) is a measure of improvement in prediction accuracy for faulttypes. Based on these gains, accuracy at predicting inspectors for fault-type A is 81%, IF is 94%, II is 88%, O is 79%. These prediction results are noteworthy especially when comparing against similar research in other domains (e.g., 80% in dyslexia study and 74% in image-feature study described in Section II).
Implications: Using the findings from this study, most prominent features either belonged to inspectors’ reading patterns (average fixation time, linear saccades per page, fixations per page, total lookups in functional) or using the timing information (i.e. time spent per page, total search time). These features provide insights into inspectors’ ability to comprehend, analyze, and detect problems in SRS. These

Figure 3. ROC curve of Fault type-IF for Wrapper Subset Selection Method Using RF classifier
15 Authorized licensed use limited to: Technische Informationsbibliothek (TIB). Downloaded on July 25,2024 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.

Fault Type

Evaluator Method
None
Classifier Subset Evaluator

Faulttype A

Information Gain

Wrapper Subset evaluation

TABLE III. EXPERIMENT RESULTS FOR ALL CONSIDERED FAULT-TYPES

Classification Type

Selected Features

TP rate FP

rate

RF

All

80.8 34.6

Ensemble (Bagging with RF)

Avg. fixation time, Linear saccades per page, Total lookups in general description, Total lookups in functional, Total # of searches, Total search time, Faults reported

84.6 34.6

Lazy Learner with RF

-Same as immediate row above-

80.8 30.8

Lazy Learner with RF Total time taken, Time spent per page, 80.8 30.8

Fixation per page, Total lookups in intro,

Linear saccade per page, Faults reported, Total

lookups in general description

Ensemble (AdaBoost with RF)

-Same as immediate row above-

80.8 34.6

Ensemble (Bagging Linear saccades per page, Total lookups in 92.3 42.3

with RF)

general description, Faults reported

Lazy Learner with NB

-Same as immediate row above-

80.8 38.5

Preci Recal

F-

sion

l

measure

70 80.8

75

71 84.6

77.2

72.4 80.8

76.4

72.4 80.8

76.4

70 80.8

75

68.6 92.3

78.7

67.7 80.8

73.7

%age of AUCROC 75.5 78.7
80.7 80.7
77.6 80.2 77.4

None

RF

All

46.7

3.2 93.3 46.7

62.2

79.4

Classifier

Ensemble (AdaBoost Average fixation time, Total lookups in 70

9.7 87.5 70

77.8

83.4

Subset Evaluator

with RF)

introduction

RF

Total lookups in intro, Total lookups in 83.3

6.5 92.6 83.3

87.7

92.6

functional, Time per page, Time reading intro,

Fault Type IF

Information Gain

Ensemble (Bagging with RF)

avg. fixation time, fixation per page -Same as immediate row above-

83.3

3.2 96.2 83.3

89.3

92.0

Voting (RF, Bagging with RF, and AdaBoost

-Same as immediate row above-

66.7

3.2 95.2 66.7

78.4

91.9

with RF)

Ensemble (AdaBoost Average fixation time, Time spent per page, 83.3

3.2 96.2 83.3

89.3

94

Wrapper

with RF)

Total lookups in intro, Total lookups in

Subset

functional, time reading description, total time

evaluation

taken

Voting (RF, Bagging

-Same as immediate row above-

93.2

3.2

96

80

87.3

93.2

with RF, and AdaBoost

with RF)

Lazy Learner with RF

-Same as immediate row above-

86.7

6.5 92.9 86.7

89.7

93.5

None

RF

All

70.8 34.8 68 70.8

69.4

79.3

Classifier

Voting (RF, Bagging Average fixation time, Fixation per page, Time 79.2 34.8 70.4 79.2

74.5

79.7

Subset

with RF, and AdaBoost spent per page, Linear saccade per page, Total

Evaluator

with RF)

lookups in functional, Total # of searches, Time spent reading description, Total search

Time

Ensemble (AdaBoost with RF)

-Same as immediate row above-

79.2 34.8 70.4 79.2

74.5

79.4

Ensemble (AdaBoost Total lookups in functional, Time spent per 79.2 34.8 70.4 79.2

74.5

79.4

Fault-

with RF)

page, Total search time, Fixation per page,

type II Information

Linear saccade per page, Time spent reading

Gain

description, Total # of searches, Avg. fixation

time

Voting (RF, Bagging with RF, and AdaBoost

-Same as immediate row above-

79.2 34.8 70.4 79.2

74.5

79.7

with RF)

RF

Total lookups in Functional, Time spent 66.7

13 84.2 66.7

74.4

86.7

Wrapper

reading intro

Subset

Voting (RF, Bagging

-Same as immediate row above-

70.8

13

85 70.8

77.3

85.5

evaluation

with RF, and AdaBoost

with RF)

Lazy Learner with RF

-Same as immediate row above-

75

17.4 81.8 75

78.3

87.6

None

RF

All

69.6 29.2 69.6 69.6

69.6

69.7

Classifier

Lazy Learner with RF

Average Fixation Time, Linear Saccade Per 52.2 45.8 52.2 52.2

52.2

55.9

Subset

Page, Time spent reading intro

Evaluator

Information Lazy Learner with RF

Total time taken, Linear saccade per page, 73.9 37.5 65.4 73.9

69.4

66.3

Fault Gain

Total lookups in intro, Time spent per page,

type O

Faults reported, Fixations per page

Wrapper

Voting (RF, Bagging Average fixation time, Time spent per page, 78.3 16.7 81.8 78.3

80

79.3

Subset

with RF, and AdaBoost Total lookups in functional, Total search time,

evaluation

with RF)

Total time taken

Ensemble (AdaBoost

-Same as immediate row above-

73.9 20.8 77.3 73.9

75.6

78.5

with RF)

prominent features, when used by ML algorithms predict inspectors’ abilities to find different fault types in an SRS.
RQ2: ML algorithms vs. fault reporting effectiveness of inspector for all fault-type

This question aimed at finding the most suitable classifiers that could predict the effectiveness of an inspector at reporting a fault-type. The result and discussion are based on the data presented in Table III. Due to space restriction, the

16 Authorized licensed use limited to: Technische Informationsbibliothek (TIB). Downloaded on July 25,2024 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.

ROC curve is only sketched (Figure 3) for the most prominent fault-type (IF) and most applicable evaluation method (i.e., wrapper method using RF classifier). The result in Figure 3 shows that the largest gain (94% when using wrapper method vs. 80% without) for AUC-ROC metric for fault-type-IF. Readers can reference Table III that shows accuracy gains across all classifiers (i.e. random forest, Lazy Learner, Naive Bayes, Voting, AdaBoost, and Bagging); all evaluation methods (information gain, wrapper subset, and classifier subset) and for all fault-type. Some of the major observations from Table III are discussed below:
Ɣ Selection of best performing inspectors are most accurately predicted by Random Forest (RF) classifier when used to create ensemble or voting method for almost every fault-type. RF when used with ensemble methods resulted in accuracy between 80% and 94%. We believe that companies can rely on these results to guide the selection of most skilled inspectors.
Ɣ In terms of accuracy for fault-types, the largest prediction accuracy values ranged from 94% for fault-type IF, 88% for II, 81% for A, and 79% for O. Mostly, this accuracy was obtained when selected features related to RPs (i.e., fixations and saccades). Figure 3 shows the ROC curve of various classification types (AdaBoost, Voting and Ensemble) using RF as classifier (Table III shows more details on performance of classification types when used with various classifiers for all fault-types). The ROC curve in Figure 3 shows that, all classification types were able to make reliable predictions across all fault-types when used with RF classifier (more the area under the curve implies more accurate detection probability).
Implications: These results also show that random forest (either alone) or when used with ensemble or voting methods can strengthen the characteristics for selecting inspectors that would identify larger number of faults. RF classifier uses multiple decision trees to fit the data (i.e. train) to classify test data. We believe that the majority features from eye-tracking data being continuous in nature are best split in intervals by the underlying decision trees. This resulted in strongly learned RF classifier that outperformed other classifiers over test data. It was also observed that the prediction is accurate and generalizable when features contains RPs of an inspector. Inspectors that tend to fixate more, exhibit linear saccades, performed more searches are more likely to identify more faults.
V. RELEVANCE OF RESULTS TO SOFTWARE COMPANIES
This research can benefit project managers at understanding the background factors of effective inspectors and to identify skilled inspectors from pool of inspectors available in their organization. As per this research, the key finding is that the inspectors’ RPs (e.g. fixations, saccades, information search) can help predict their eventual inspection effectiveness. Additionally, this research provides insights on how a small subset of prominent features when used with appropriate classification method (e.g., RF with ensemble) can result in strong prediction. To be able to generalize these results, companies would need to collect baseline set of data (on inspectors reading patterns) to be able to automate the selection of skilled inspectors for different fault types. While

there is some investment involved in collecting these data types, it would provide managers with more objective information on how to plan and manage the inspection process. While we do not claim this to be final solution, this is a promising start and we would need to collect data from different participants and in different settings to be able to completely generalize the findings.

VI.

CONCLUSION AND FUTURE SCOPE

This paper presented research on finding important generalizable background factors (specifically their RPs) that predominantly affect fault reporting effectiveness of an inspector. It has been found that inspectors’ RPs (fixations, saccades and timing data) were most informative at predicting most suitable inspectors for different fault-types. Ensemble methods (with RF) and wrapper subset evaluation methods yielded significant accuracy (up to 94%) in predicting most effective inspectors. Our future work includes the replication of this study with different inspectors, with additional classification families like Neural Networks, when inspecting different requirements documents, and with varying experiences. Also, we plan to expand the number of attribute evaluation methods in future replications.

REFERENCES

[1] M. E. Fagan, “Advances in Software Inspections,” IEEE Trans. Softw. Eng., vol. SE-12, no. 7, pp. 744–751, 1986.
[2] A. Goswami, G. S. Walia, and U. Rathod, “Using Learning Styles to Staff and Improve Software Inspection Team Performance,” Proc. - 2016 IEEE 27th Int. Symp. Softw. Reliab. Eng. Work. ISSREW 2016, pp. 9–12, 2016.
[3] J. C. Carver, N. Nagappan, and A. Page, “The impact of educational background on the effectiveness of requirements inspections: An empirical study,” IEEE Trans. Softw. Eng., vol. 34, no. 6, pp. 800–812, 2008.
[4] A. . Goswami, G. . Walia, and A. . Singh, “Using learning styles of software professionals to improve their inspection team performance,” Proc. Int. Conf. Softw. Eng. Knowl. Eng. SEKE, vol. 2015–Janua, pp. 680–685, 2015.
[5] A. Goswami and G. Walia, “An empirical study of the effect of learning styles on the faults found during the software requirements inspection,” 2013 IEEE 24th Int. Symp. Softw. Reliab. Eng., pp. 330–339, 2013.
[6] M. Singh, V. Anu, G. S. Walia, and A. Goswami, “Validating Requirements Reviews by Introducing Fault-Type Level Granularity,” in Proceedings of the 11th Innovations in Software Engineering Conference on - ISEC ’18, 2018, pp. 1–11.
[7] M. Just and P. Carpenter, “A theory of reading: from eye fixations to comprehension.,” Psychol. Rev., vol. 87, no. 4, pp. 329–354, 1980.
[8] P. Chandon, J. W. Hutchinson, E. T. Bradlow, and S. H. Young, “Measuring the value of point-of-purchase marketing with commercial eye-tracking data,” Soc. Sci. Res. Netw., p. 46, 2001.
[9] J. H. Goldberg and X. P. Kotval, “Computer interface evaluation using eye movements: Methods and constructs,” Int. J. Ind. Ergon., vol. 24, no. 6, pp. 631–645, 1999.
[10] S. Zhai, “What’s in the eyes for attentive input,” Commun. ACM, vol. 46, no. 3, p. 34, 2003.
[11] T. J. Mehigan and I. Pitt, “Detecting Learning Style through Biometric Technology for Mobile GBL,” Int. J. Game-Based Learn., vol. 2, no. 2, pp. 55–74, 2012.
[12] K. Kunze, Y. Utsumi, Y. Shiga, K. Kise, and A. Bulling, “I know what you are reading: recognition of document types using mobile eye tracking,” UBICOMP, pp. 113–116, 2013.
[13] L. Rello and M. Ballesteros, “Detecting readers with dyslexia using machine learning with eye tracking measures,” in Proceedings of the 12th Web for All Conference on - W4A ’15, 2015, pp. 1–8.
[14] A. Goswami, G. Walia, M. McCourt, and G. Padmanabhan, “Using Eye Tracking to Investigate Reading Patterns and Learning Styles of Software Requirement Inspectors to Enhance Inspection Team Outcome,” Proc. 10th ACM/IEEE Int. Symp. Empir. Softw. Eng. Meas. - ESEM ’16, vol. 08–09– Sept, pp. 1–10, 2016.
[15] M. A. Hall, “Correlation-based Feature Selection for Machine Learning,” The University of Waikato, 1998.
[16] M. A. Hall and G. Holmes, “Benchmarking attribute selection techniques for data mining,” … Data Eng. IEEE Trans. …, vol. 15, no. 6, pp. 1437–1447, 2003.
[17] M. Singh, G. S. Walia, and A. Goswami, “An Empirical Investigation to Overcome Class-imbalance in Inspection Reviews,” in 2017 International Conference on Machine Learning and Data Science (MLDS), 2017, pp. 128– 135.

17 Authorized licensed use limited to: Technische Informationsbibliothek (TIB). Downloaded on July 25,2024 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.