Impact of Feature Selection Methods on the Predictive Performance of Software Defect Prediction Models: An Extensive Empirical Study

Open Access

8 July 2020

journal article
research article
Published by MDPI AG in Symmetry

Vol. 12 (7), 1147
https://doi.org/10.3390/sym12071147

Abstract

Feature selection (FS) is a feasible solution for mitigating high dimensionality problem, and many FS methods have been proposed in the context of software defect prediction (SDP). Moreover, many empirical studies on the impact and effectiveness of FS methods on SDP models often lead to contradictory experimental results and inconsistent findings. These contradictions can be attributed to relative study limitations such as small datasets, limited FS search methods, and unsuitable prediction models in the respective scope of studies. It is hence critical to conduct an extensive empirical study to address these contradictions to guide researchers and buttress the scientific tenacity of experimental conclusions. In this study, we investigated the impact of 46 FS methods using Naïve Bayes and Decision Tree classifiers over 25 software defect datasets from 4 software repositories (NASA, PROMISE, ReLink, and AEEEM). The ensuing prediction models were evaluated based on accuracy and AUC values. Scott–KnottESD and the novel Double Scott–KnottESD rank statistical methods were used for statistical ranking of the studied FS methods. The experimental results showed that there is no one best FS method as their respective performances depends on the choice of classifiers, performance evaluation metrics, and dataset. However, we recommend the use of statistical-based, probability-based, and classifier-based filter feature ranking (FFR) methods, respectively, in SDP. For filter subset selection (FSS) methods, correlation-based feature selection (CFS) with metaheuristic search methods is recommended. For wrapper feature selection (WFS) methods, the IWSS-based WFS method is recommended as it outperforms the conventional SFS and LHS-based WFS methods.

Keywords

Funding Information

Universiti Teknologi Petronas (YUTP-FRG/015LC0240)

This publication has 41 references indexed in Scilit:

An empirical study on software defect prediction with a simplified metric set
Information and Software Technology, 2015
Introduction
Springer Texts in Statistics, 2013
An Empirical Study on the Stability of Feature Selection for Imbalanced Software Engineering Data
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2012
AN EMPIRICAL STUDY OF FEATURE RANKING TECHNIQUES FOR SOFTWARE QUALITY PREDICTION
International Journal of Software Engineering and Knowledge Engineering, 2012
Mining Static Code Metrics for a Robust Prediction of Software Defect-Proneness
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Choosing software metrics for defect prediction: an investigation on feature selection techniques
Software: Practice and Experience, 2011
METRIC SELECTION FOR SOFTWARE DEFECT PREDICTION
International Journal of Software Engineering and Knowledge Engineering, 2011
Predicting high-risk program modules by selecting the right software measurements
Software Quality Journal, 2011
The WEKA data mining software
ACM SIGKDD Explorations Newsletter, 2009
Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem
Information Sciences, 2009

Cited by 41 articles