Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

(2024) 11:44 | Huanjing Wang, Qianxin Liang, John T. Hancock, Taghi M. Khoshgoftaar
This study compares the effectiveness of feature selection techniques based on SHAP (SHapley Additive exPlanations) values and importance-based methods in the context of credit card fraud detection. The research uses the Kaggle Credit Card Fraud Detection Dataset, which contains 284,807 transactions with 30 features, of which only 492 are fraudulent. Two feature selection methods—SHAP-based and importance-based—are applied to five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The performance of these models is evaluated using the Area Under the Precision-Recall Curve (AUPRC). The results indicate that feature selection methods based on importance values generally outperform those based on SHAP values across all classifiers and various feature subset sizes. However, for larger datasets and more complex models, using the model's built-in feature importance list is recommended due to its efficiency and practicality. The study also highlights the computational complexity of SHAP and suggests that it may not be as practical for large datasets compared to built-in feature importance methods. The findings provide valuable insights for researchers and practitioners in the field of credit card fraud detection and machine learning.This study compares the effectiveness of feature selection techniques based on SHAP (SHapley Additive exPlanations) values and importance-based methods in the context of credit card fraud detection. The research uses the Kaggle Credit Card Fraud Detection Dataset, which contains 284,807 transactions with 30 features, of which only 492 are fraudulent. Two feature selection methods—SHAP-based and importance-based—are applied to five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The performance of these models is evaluated using the Area Under the Precision-Recall Curve (AUPRC). The results indicate that feature selection methods based on importance values generally outperform those based on SHAP values across all classifiers and various feature subset sizes. However, for larger datasets and more complex models, using the model's built-in feature importance list is recommended due to its efficiency and practicality. The study also highlights the computational complexity of SHAP and suggests that it may not be as practical for large datasets compared to built-in feature importance methods. The findings provide valuable insights for researchers and practitioners in the field of credit card fraud detection and machine learning.
Reach us at info@study.space