Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

2024 | Huanjing Wang, Qianxin Liang, John T. Hancock and Taghi M. Khoshgoftaar
This study compares two feature selection methods—SHAP-value-based and importance-based—for credit card fraud detection. The research evaluates the effectiveness of these methods using five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The evaluation metric used is the Area under the Precision-Recall Curve (AUPRC). The experiments were conducted on the Kaggle Credit Card Fraud Detection Dataset, which contains 284,807 transactions, with only 492 labeled as fraudulent (0.172% of the total). Both methods rank features and select the most significant ones for model assessment. The study finds that importance-based feature selection methods outperform SHAP-value-based methods across all classifiers and various feature subset sizes. For models trained on larger datasets, the model's built-in feature importance list is recommended as the primary feature selection method over SHAP. This is because computing SHAP feature importance is a separate process, while models naturally provide built-in feature importance during training, requiring no additional effort. The study also highlights that using the model's built-in feature importance list offers a more efficient and practical approach for larger datasets and more complex models. The results show that for certain feature subset sizes, such as 3 and 10, importance-based methods outperform SHAP-based methods. However, for larger subset sizes, there is no significant difference between the two methods. The study concludes that while SHAP provides insights into model predictions, the built-in feature importance list is more suitable for large-scale applications due to its efficiency and practicality.This study compares two feature selection methods—SHAP-value-based and importance-based—for credit card fraud detection. The research evaluates the effectiveness of these methods using five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The evaluation metric used is the Area under the Precision-Recall Curve (AUPRC). The experiments were conducted on the Kaggle Credit Card Fraud Detection Dataset, which contains 284,807 transactions, with only 492 labeled as fraudulent (0.172% of the total). Both methods rank features and select the most significant ones for model assessment. The study finds that importance-based feature selection methods outperform SHAP-value-based methods across all classifiers and various feature subset sizes. For models trained on larger datasets, the model's built-in feature importance list is recommended as the primary feature selection method over SHAP. This is because computing SHAP feature importance is a separate process, while models naturally provide built-in feature importance during training, requiring no additional effort. The study also highlights that using the model's built-in feature importance list offers a more efficient and practical approach for larger datasets and more complex models. The results show that for certain feature subset sizes, such as 3 and 10, importance-based methods outperform SHAP-based methods. However, for larger subset sizes, there is no significant difference between the two methods. The study concludes that while SHAP provides insights into model predictions, the built-in feature importance list is more suitable for large-scale applications due to its efficiency and practicality.
Reach us at info@study.space