Selection bias in gene extraction on the basis of microarray gene-expression data

Selection bias in gene extraction on the basis of microarray gene-expression data

May 14, 2002 | Christophe Ambroise and Geoffrey J. McLachlan
The paper addresses the issue of selection bias in gene extraction based on microarray gene-expression data, which can lead to over-optimistic estimates of prediction error when using methods like leave-one-out cross-validation (CV) or internal cross-validation. Selection bias occurs when the test set used to evaluate a prediction rule is also used in the gene selection process, leading to an inaccurate assessment of the rule's performance on new, unseen data. To correct for this bias, external cross-validation or the bootstrap method is recommended. The paper emphasizes the use of 10-fold cross-validation over leave-one-out CV and the .632+ bootstrap error estimate, which accounts for overfitting. The study demonstrates that when selection bias is corrected, the cross-validated error is no longer zero for a subset of only a few genes. It uses two published data sets: the colon data and the leukemia data. For the colon data, the true prediction error rate is estimated to be above 15%, while for the leukemia data, it is approximately 5%. The .632+ bootstrap error estimate is found to have a slightly smaller root mean squared error than 10-fold cross-validation for both data sets. The paper also compares the performance of different feature selection methods, including backward elimination with support vector machines (SVM) and forward selection with Fisher's linear discriminant rule. It shows that while feature selection can reduce the number of genes used, it does not necessarily improve the prediction error rate. In fact, the performance of the SVM rule with backward elimination is slightly better than that of the Fisher's rule with forward selection. The study highlights the importance of correcting for selection bias when estimating the prediction error of a rule formed using a subset of genes selected from a large set of available genes. It also notes that if a test set is used to estimate the prediction error, it should not be involved in the gene selection process to avoid selection bias. The paper concludes that cross-validation and the bootstrap are effective methods for correcting selection bias and that the .632+ bootstrap error estimate is particularly useful for handling overfitting in prediction rules.The paper addresses the issue of selection bias in gene extraction based on microarray gene-expression data, which can lead to over-optimistic estimates of prediction error when using methods like leave-one-out cross-validation (CV) or internal cross-validation. Selection bias occurs when the test set used to evaluate a prediction rule is also used in the gene selection process, leading to an inaccurate assessment of the rule's performance on new, unseen data. To correct for this bias, external cross-validation or the bootstrap method is recommended. The paper emphasizes the use of 10-fold cross-validation over leave-one-out CV and the .632+ bootstrap error estimate, which accounts for overfitting. The study demonstrates that when selection bias is corrected, the cross-validated error is no longer zero for a subset of only a few genes. It uses two published data sets: the colon data and the leukemia data. For the colon data, the true prediction error rate is estimated to be above 15%, while for the leukemia data, it is approximately 5%. The .632+ bootstrap error estimate is found to have a slightly smaller root mean squared error than 10-fold cross-validation for both data sets. The paper also compares the performance of different feature selection methods, including backward elimination with support vector machines (SVM) and forward selection with Fisher's linear discriminant rule. It shows that while feature selection can reduce the number of genes used, it does not necessarily improve the prediction error rate. In fact, the performance of the SVM rule with backward elimination is slightly better than that of the Fisher's rule with forward selection. The study highlights the importance of correcting for selection bias when estimating the prediction error of a rule formed using a subset of genes selected from a large set of available genes. It also notes that if a test set is used to estimate the prediction error, it should not be involved in the gene selection process to avoid selection bias. The paper concludes that cross-validation and the bootstrap are effective methods for correcting selection bias and that the .632+ bootstrap error estimate is particularly useful for handling overfitting in prediction rules.
Reach us at info@study.space