Toward reliable diabetes prediction: Innovations in data engineering and machine learning applications

Toward reliable diabetes prediction: Innovations in data engineering and machine learning applications

2024 | Md. Alamin Talukder, Md. Manowarul Islam, Md Ashraf Uddin, Mohsin Kazi, Majdi Khalid, Arnisha Akhter, and Mohammad Ali Moni
This paper presents an optimized data preprocessing pipeline and machine learning (ML) models to improve the accuracy of diabetes prediction. The study addresses the challenges of imbalanced datasets and overfitting by implementing random oversampling and k-fold cross-validation, respectively. Four different diabetes datasets were used to evaluate the proposed method, and several ML algorithms were compared, including decision trees (DT), Naive Bayes (NB), K-nearest neighbors (KNN), logistic regression (LR), extreme gradient boosting (XGBoost), and support vector machine (SVM). The results show that the proposed method significantly improves accuracy, precision, recall, and F1-score compared to existing methods. Specifically, random forest (RF) achieved an accuracy of 86% and 98.48% for Dataset 1 and Dataset 2, while XGBoost and DT achieved 99.27% and 100% for Dataset 3 and Dataset 4, respectively. The study concludes that the proposed models can enhance diabetes prediction accuracy, contributing to better preventative interventions and reducing the incidence and costs associated with diabetes.This paper presents an optimized data preprocessing pipeline and machine learning (ML) models to improve the accuracy of diabetes prediction. The study addresses the challenges of imbalanced datasets and overfitting by implementing random oversampling and k-fold cross-validation, respectively. Four different diabetes datasets were used to evaluate the proposed method, and several ML algorithms were compared, including decision trees (DT), Naive Bayes (NB), K-nearest neighbors (KNN), logistic regression (LR), extreme gradient boosting (XGBoost), and support vector machine (SVM). The results show that the proposed method significantly improves accuracy, precision, recall, and F1-score compared to existing methods. Specifically, random forest (RF) achieved an accuracy of 86% and 98.48% for Dataset 1 and Dataset 2, while XGBoost and DT achieved 99.27% and 100% for Dataset 3 and Dataset 4, respectively. The study concludes that the proposed models can enhance diabetes prediction accuracy, contributing to better preventative interventions and reducing the incidence and costs associated with diabetes.
Reach us at info@study.space