Toward reliable diabetes prediction: Innovations in data engineering and machine learning applications

Toward reliable diabetes prediction: Innovations in data engineering and machine learning applications

2024 | Md. Alamin Talukder, Md. Manowarul Islam, Md Ashraf Uddin, Mohsin Kazi, Majdi Khalid, Arnisha Akhter and Mohammad Ali Moni
This paper presents an innovative approach to diabetes prediction using advanced data engineering and machine learning techniques. The study aims to improve the accuracy of diabetes diagnosis by developing an efficient data preprocessing pipeline and applying various machine learning algorithms. The research addresses the challenge of imbalanced datasets and enhances model performance through random oversampling and k-fold cross-validation. The study evaluates four different diabetes datasets, each with unique attributes, to assess the effectiveness of the proposed method. The results demonstrate that the proposed approach significantly improves accuracy compared to existing methods. For example, the random forest algorithm achieved an accuracy of 86% on Dataset 1 and 98.48% on Dataset 2, while extreme gradient boosting and decision tree algorithms achieved 99.27% and 100% accuracy on Dataset 3 and Dataset 4, respectively. The proposed method also improved accuracy by 12.15% compared to models without preprocessing. The paper highlights the importance of data preprocessing in improving the quality of datasets and enhancing the performance of machine learning models. Techniques such as handling missing values, outlier removal, label encoding, and data normalization are employed to ensure the dataset is suitable for model training. Additionally, random oversampling is used to balance the dataset and prevent bias in the model. The study also emphasizes the role of k-fold cross-validation in preventing overfitting and ensuring the reliability of the model for diabetes prediction. The results show that the proposed method outperforms existing approaches in terms of accuracy, precision, recall, and F1-score. The model's performance is validated through extensive experimental analysis on diverse datasets, demonstrating its effectiveness in diabetes detection and prognosis. The paper contributes to the field of diabetes research by providing an optimized data preprocessing pipeline, addressing dataset imbalance, and demonstrating superior performance through extensive experimentation. The findings indicate that the proposed models can enhance the accuracy of diabetes predictions, supporting current preventative interventions to reduce the incidence of diabetes and its associated costs.This paper presents an innovative approach to diabetes prediction using advanced data engineering and machine learning techniques. The study aims to improve the accuracy of diabetes diagnosis by developing an efficient data preprocessing pipeline and applying various machine learning algorithms. The research addresses the challenge of imbalanced datasets and enhances model performance through random oversampling and k-fold cross-validation. The study evaluates four different diabetes datasets, each with unique attributes, to assess the effectiveness of the proposed method. The results demonstrate that the proposed approach significantly improves accuracy compared to existing methods. For example, the random forest algorithm achieved an accuracy of 86% on Dataset 1 and 98.48% on Dataset 2, while extreme gradient boosting and decision tree algorithms achieved 99.27% and 100% accuracy on Dataset 3 and Dataset 4, respectively. The proposed method also improved accuracy by 12.15% compared to models without preprocessing. The paper highlights the importance of data preprocessing in improving the quality of datasets and enhancing the performance of machine learning models. Techniques such as handling missing values, outlier removal, label encoding, and data normalization are employed to ensure the dataset is suitable for model training. Additionally, random oversampling is used to balance the dataset and prevent bias in the model. The study also emphasizes the role of k-fold cross-validation in preventing overfitting and ensuring the reliability of the model for diabetes prediction. The results show that the proposed method outperforms existing approaches in terms of accuracy, precision, recall, and F1-score. The model's performance is validated through extensive experimental analysis on diverse datasets, demonstrating its effectiveness in diabetes detection and prognosis. The paper contributes to the field of diabetes research by providing an optimized data preprocessing pipeline, addressing dataset imbalance, and demonstrating superior performance through extensive experimentation. The findings indicate that the proposed models can enhance the accuracy of diabetes predictions, supporting current preventative interventions to reduce the incidence of diabetes and its associated costs.
Reach us at info@study.space