Prediction error estimation: a comparison of resampling methods

Prediction error estimation: a comparison of resampling methods

2005 | Annette M. Molinaro, Richard Simon, Ruth M. Pfeiffer
This paper compares resampling methods for estimating prediction error in high-dimensional genomic data, focusing on the impact of feature selection. The study evaluates methods such as resubstitution, split-sample, leave-one-out cross-validation (LOOCV), 10-fold cross-validation (CV), and the .632+ bootstrap. The results show that for small sample sizes, LOOCV and 10-fold CV have the smallest bias for discriminant analysis, nearest neighbor, and classification trees, while LOOCV and 10-fold CV have the smallest bias for linear discriminant analysis. The .632+ bootstrap has the lowest mean square error but is biased in small samples with strong signal-to-noise ratios. As sample size increases, the performance differences among resampling methods decrease. The study uses simulated, microarray, and proteomic data to assess the effectiveness of these methods in estimating prediction error. The results indicate that LOOCV and 10-fold CV generally perform well in terms of bias and mean squared error, while the .632+ bootstrap is more suitable for moderate to weak signal-to-noise ratios. The study also highlights the importance of feature selection in reducing bias and improving prediction accuracy. Overall, the findings suggest that LOOCV and 10-fold CV are preferred for small samples, while the .632+ bootstrap is more appropriate for moderate to weak signal-to-noise ratios. The study concludes that resampling methods should be chosen based on the specific characteristics of the data and the goals of the analysis.This paper compares resampling methods for estimating prediction error in high-dimensional genomic data, focusing on the impact of feature selection. The study evaluates methods such as resubstitution, split-sample, leave-one-out cross-validation (LOOCV), 10-fold cross-validation (CV), and the .632+ bootstrap. The results show that for small sample sizes, LOOCV and 10-fold CV have the smallest bias for discriminant analysis, nearest neighbor, and classification trees, while LOOCV and 10-fold CV have the smallest bias for linear discriminant analysis. The .632+ bootstrap has the lowest mean square error but is biased in small samples with strong signal-to-noise ratios. As sample size increases, the performance differences among resampling methods decrease. The study uses simulated, microarray, and proteomic data to assess the effectiveness of these methods in estimating prediction error. The results indicate that LOOCV and 10-fold CV generally perform well in terms of bias and mean squared error, while the .632+ bootstrap is more suitable for moderate to weak signal-to-noise ratios. The study also highlights the importance of feature selection in reducing bias and improving prediction accuracy. Overall, the findings suggest that LOOCV and 10-fold CV are preferred for small samples, while the .632+ bootstrap is more appropriate for moderate to weak signal-to-noise ratios. The study concludes that resampling methods should be chosen based on the specific characteristics of the data and the goals of the analysis.
Reach us at info@study.space