2024 | Sheng Zhou, Chujiao Hu, Shanshan Wei, and Xiaofan Yan
This study presents a breast cancer classification system based on multiple machine learning algorithms, aiming to improve diagnostic accuracy and efficiency. The research uses the Wisconsin Breast Cancer Dataset, which contains 569 samples with 32 features, including 30 physiological characteristics and one classification label (benign or malignant). The dataset was preprocessed to handle missing values, normalize data, and select features with high correlation to the target variable using the Spearman correlation coefficient. A subset of 10 features was chosen for further analysis.
The study evaluates seven machine learning algorithms: decision tree, stochastic gradient descent (SGD), random forest, support vector machine (SVM), logistic regression, and AdaBoost. The AdaBoost-Logistic algorithm achieved the highest accuracy of 99.12%, outperforming the other six algorithms. The algorithm's performance was assessed using accuracy, precision, recall, F1 score, and confusion matrices. The AdaBoost-Logistic model demonstrated excellent classification performance for both benign and malignant breast cancer cases.
The study also investigates the statistical significance of the feature subset using the Wilcoxon rank sum test, which revealed significant differences in the distribution of the feature data across breast cancer categories. The results indicate that the feature subset data are strongly correlated with the target variable, making them suitable for classification tasks.
The AdaBoost-Logistic algorithm was further validated using receiver operating characteristic (ROC) and precision-recall curves, which showed high accuracy and AUC values. The model's performance was also visualized using 2D and 3D scatter plots, demonstrating its effectiveness in classifying breast cancer cases.
The study highlights the importance of feature selection and preprocessing in improving the performance of machine learning models for breast cancer diagnosis. The AdaBoost-Logistic algorithm offers a high-precision solution for breast cancer classification, with potential applications in clinical settings for early detection and diagnosis. The results suggest that integrating machine learning with medical data can significantly enhance the accuracy and efficiency of breast cancer diagnosis.This study presents a breast cancer classification system based on multiple machine learning algorithms, aiming to improve diagnostic accuracy and efficiency. The research uses the Wisconsin Breast Cancer Dataset, which contains 569 samples with 32 features, including 30 physiological characteristics and one classification label (benign or malignant). The dataset was preprocessed to handle missing values, normalize data, and select features with high correlation to the target variable using the Spearman correlation coefficient. A subset of 10 features was chosen for further analysis.
The study evaluates seven machine learning algorithms: decision tree, stochastic gradient descent (SGD), random forest, support vector machine (SVM), logistic regression, and AdaBoost. The AdaBoost-Logistic algorithm achieved the highest accuracy of 99.12%, outperforming the other six algorithms. The algorithm's performance was assessed using accuracy, precision, recall, F1 score, and confusion matrices. The AdaBoost-Logistic model demonstrated excellent classification performance for both benign and malignant breast cancer cases.
The study also investigates the statistical significance of the feature subset using the Wilcoxon rank sum test, which revealed significant differences in the distribution of the feature data across breast cancer categories. The results indicate that the feature subset data are strongly correlated with the target variable, making them suitable for classification tasks.
The AdaBoost-Logistic algorithm was further validated using receiver operating characteristic (ROC) and precision-recall curves, which showed high accuracy and AUC values. The model's performance was also visualized using 2D and 3D scatter plots, demonstrating its effectiveness in classifying breast cancer cases.
The study highlights the importance of feature selection and preprocessing in improving the performance of machine learning models for breast cancer diagnosis. The AdaBoost-Logistic algorithm offers a high-precision solution for breast cancer classification, with potential applications in clinical settings for early detection and diagnosis. The results suggest that integrating machine learning with medical data can significantly enhance the accuracy and efficiency of breast cancer diagnosis.