Performance discrepancy mitigation in heart disease prediction for multisensory inter-datasets

Performance discrepancy mitigation in heart disease prediction for multisensory inter-datasets

18 March 2024 | Mahmudul Hasan, Md Abdus Sahid, Md Palash Uddin, Md Abu Marjan, Seifedine Kadry, Jung-eun Kim
This study addresses the inter-dataset discrepancy problem in heart disease prediction by analyzing five available heart disease datasets and their combined form. The researchers employed a comprehensive preprocessing pipeline, including handling missing values, log transformation, outlier removal, normalization, and data balancing using SMOTE-Tomek, random forest (RF), and principal component analysis (PCA). The goal was to mitigate the performance discrepancy between different datasets and improve the accuracy of machine learning (ML) models for heart disease prediction. The study evaluated various ML classifiers, including support vector machine (SVM), K-nearest neighbors (KNN), decision tree (DT), RF, extreme gradient boosting (XGBoost), Gaussian naive Bayes (GNB), logistic regression (LR), and multilayer perceptron (MLP). The results showed that feature selection using RF produced better results than other combination strategies in both single- and inter-dataset setups. In certain configurations, RF achieved 100% accuracy during the feature selection phase in an inter-dataset setup, demonstrating high precision, recall, F1 score, specificity, and AUC score. The study highlights the importance of preprocessing techniques in improving ML model performance without the need for complex prediction models. It also introduces a novel research avenue by addressing inter-dataset discrepancies, enabling the integration of features from multiple datasets to create a comprehensive global dataset within a specific domain. The findings suggest that effective preprocessing can significantly enhance the performance of ML models for heart disease prediction.This study addresses the inter-dataset discrepancy problem in heart disease prediction by analyzing five available heart disease datasets and their combined form. The researchers employed a comprehensive preprocessing pipeline, including handling missing values, log transformation, outlier removal, normalization, and data balancing using SMOTE-Tomek, random forest (RF), and principal component analysis (PCA). The goal was to mitigate the performance discrepancy between different datasets and improve the accuracy of machine learning (ML) models for heart disease prediction. The study evaluated various ML classifiers, including support vector machine (SVM), K-nearest neighbors (KNN), decision tree (DT), RF, extreme gradient boosting (XGBoost), Gaussian naive Bayes (GNB), logistic regression (LR), and multilayer perceptron (MLP). The results showed that feature selection using RF produced better results than other combination strategies in both single- and inter-dataset setups. In certain configurations, RF achieved 100% accuracy during the feature selection phase in an inter-dataset setup, demonstrating high precision, recall, F1 score, specificity, and AUC score. The study highlights the importance of preprocessing techniques in improving ML model performance without the need for complex prediction models. It also introduces a novel research avenue by addressing inter-dataset discrepancies, enabling the integration of features from multiple datasets to create a comprehensive global dataset within a specific domain. The findings suggest that effective preprocessing can significantly enhance the performance of ML models for heart disease prediction.
Reach us at info@futurestudyspace.com