18 March 2024 | Mahmudul Hasan, Md Abdus Sahid, Md Palash Uddin, Md Abu Marjan, Seifedine Kadry, Jungeun Kim
This paper addresses the inter-dataset performance discrepancy in heart disease prediction, a critical issue where models trained on one dataset perform poorly on another. The authors propose a comprehensive preprocessing pipeline and evaluate various machine learning (ML) classifiers to mitigate this discrepancy. The study uses five heart disease datasets: Cleveland, Hungarian, Switzerland, Long Beach VA, and StatLog, along with their combined form. The preprocessing pipeline includes handling missing values using RF regression, log transformation, outlier detection, normalization, and imbalanced data handling with SMOTETomek. Feature reduction is achieved through Principal Component Analysis (PCA) and feature selection using Random Forest (RF). The performance of different ML classifiers, including SVM, K-nearest neighbors (KNN), decision tree (DT), RF, eXtreme Gradient Boosting (XGBoost), Gaussian Naive Bayes (GNB), logistic regression (LR), and multilayer perceptron (MLP), is evaluated. The results show that RF performs better than other classifiers in both single- and inter-dataset setups, achieving 100% accuracy and 96% accuracy in certain configurations. The study highlights the importance of effective preprocessing techniques in improving ML model performance without the need for complex prediction models.This paper addresses the inter-dataset performance discrepancy in heart disease prediction, a critical issue where models trained on one dataset perform poorly on another. The authors propose a comprehensive preprocessing pipeline and evaluate various machine learning (ML) classifiers to mitigate this discrepancy. The study uses five heart disease datasets: Cleveland, Hungarian, Switzerland, Long Beach VA, and StatLog, along with their combined form. The preprocessing pipeline includes handling missing values using RF regression, log transformation, outlier detection, normalization, and imbalanced data handling with SMOTETomek. Feature reduction is achieved through Principal Component Analysis (PCA) and feature selection using Random Forest (RF). The performance of different ML classifiers, including SVM, K-nearest neighbors (KNN), decision tree (DT), RF, eXtreme Gradient Boosting (XGBoost), Gaussian Naive Bayes (GNB), logistic regression (LR), and multilayer perceptron (MLP), is evaluated. The results show that RF performs better than other classifiers in both single- and inter-dataset setups, achieving 100% accuracy and 96% accuracy in certain configurations. The study highlights the importance of effective preprocessing techniques in improving ML model performance without the need for complex prediction models.