2024 | John T. Hancock¹, Huanjing Wang², Taghi M. Khoshgoftaar¹ and Qianxin Liang¹
This study investigates the effectiveness of data reduction techniques for handling highly imbalanced Medicare Big Data in fraud detection. The research evaluates the combined impact of Random Undersampling (RUS) and an ensemble supervised feature selection method on the performance of Machine Learning models. The study uses two datasets from the Centers for Medicare & Medicaid Services (CMS), labeled by the List of Excluded Individuals/Entities (LEIE), to assess the impact of these techniques on classification performance. The results show that combining RUS and feature selection significantly improves classification performance, particularly in terms of the Area Under the Precision Recall Curve (AUPRC). The study also demonstrates that reducing the number of features leads to more interpretable models. The research highlights the importance of data reduction techniques in improving the effectiveness of fraud detection systems, which can have significant implications for healthcare services. The study also explores the application of these techniques in different scenarios, including RUS alone, feature selection alone, and their combination. The findings suggest that data reduction techniques can enhance the performance of Machine Learning models in detecting Medicare fraud, which is a critical issue with significant financial implications. The study also compares the results with previous research and highlights the importance of using threshold-agnostic metrics like AUPRC for evaluating the performance of classifiers on imbalanced data. The research contributes to the field of healthcare insurance fraud detection by providing a comprehensive analysis of the impact of data reduction techniques on the performance of Machine Learning models. The study also emphasizes the importance of using a variety of classifiers and feature selection methods to improve the accuracy and effectiveness of fraud detection systems. The research provides a detailed methodology for combining data reduction techniques and highlights the importance of statistical analysis in evaluating the performance of these techniques. The study also discusses the challenges of handling imbalanced data and high dimensionality in healthcare insurance fraud detection, and proposes data reduction techniques as a solution. The research concludes that data reduction techniques can significantly improve the performance of Machine Learning models in detecting Medicare fraud, and that these techniques should be considered as a key component of fraud detection systems. The study also highlights the importance of using a variety of data sources and classifiers to improve the accuracy and effectiveness of fraud detection systems. The research provides a comprehensive analysis of the impact of data reduction techniques on the performance of Machine Learning models in detecting Medicare fraud, and highlights the importance of using threshold-agnostic metrics like AUPRC for evaluating the performance of classifiers on imbalanced data. The study also emphasizes the importance of using a variety of data sources and classifiers to improve the accuracy and effectiveness of fraud detection systems. The research concludes that data reduction techniques can significantly improve the performance of Machine Learning models in detecting Medicare fraud, and that these techniques should be considered as a key component of fraud detection systems.This study investigates the effectiveness of data reduction techniques for handling highly imbalanced Medicare Big Data in fraud detection. The research evaluates the combined impact of Random Undersampling (RUS) and an ensemble supervised feature selection method on the performance of Machine Learning models. The study uses two datasets from the Centers for Medicare & Medicaid Services (CMS), labeled by the List of Excluded Individuals/Entities (LEIE), to assess the impact of these techniques on classification performance. The results show that combining RUS and feature selection significantly improves classification performance, particularly in terms of the Area Under the Precision Recall Curve (AUPRC). The study also demonstrates that reducing the number of features leads to more interpretable models. The research highlights the importance of data reduction techniques in improving the effectiveness of fraud detection systems, which can have significant implications for healthcare services. The study also explores the application of these techniques in different scenarios, including RUS alone, feature selection alone, and their combination. The findings suggest that data reduction techniques can enhance the performance of Machine Learning models in detecting Medicare fraud, which is a critical issue with significant financial implications. The study also compares the results with previous research and highlights the importance of using threshold-agnostic metrics like AUPRC for evaluating the performance of classifiers on imbalanced data. The research contributes to the field of healthcare insurance fraud detection by providing a comprehensive analysis of the impact of data reduction techniques on the performance of Machine Learning models. The study also emphasizes the importance of using a variety of classifiers and feature selection methods to improve the accuracy and effectiveness of fraud detection systems. The research provides a detailed methodology for combining data reduction techniques and highlights the importance of statistical analysis in evaluating the performance of these techniques. The study also discusses the challenges of handling imbalanced data and high dimensionality in healthcare insurance fraud detection, and proposes data reduction techniques as a solution. The research concludes that data reduction techniques can significantly improve the performance of Machine Learning models in detecting Medicare fraud, and that these techniques should be considered as a key component of fraud detection systems. The study also highlights the importance of using a variety of data sources and classifiers to improve the accuracy and effectiveness of fraud detection systems. The research provides a comprehensive analysis of the impact of data reduction techniques on the performance of Machine Learning models in detecting Medicare fraud, and highlights the importance of using threshold-agnostic metrics like AUPRC for evaluating the performance of classifiers on imbalanced data. The study also emphasizes the importance of using a variety of data sources and classifiers to improve the accuracy and effectiveness of fraud detection systems. The research concludes that data reduction techniques can significantly improve the performance of Machine Learning models in detecting Medicare fraud, and that these techniques should be considered as a key component of fraud detection systems.