A survey of cross-validation procedures for model selection
Sylvain Arlot and Alain Celisse present a comprehensive review of cross-validation (CV) procedures for model selection, emphasizing their theoretical and empirical properties. CV is a widely used strategy for estimating the risk of an estimator or selecting the best model among several candidates. It works by splitting the data into training and validation sets to estimate the risk of each algorithm. The algorithm with the smallest estimated risk is selected. CV avoids overfitting by using independent training and validation samples, and its popularity stems from its generality and simplicity. However, theoretical and empirical studies show that CV is not universally effective, and some procedures may fail for specific model selection tasks, such as estimation or identification.
The paper discusses various model selection paradigms, including estimation and identification, and explores different model selection procedures, such as unbiased risk estimation, biased risk estimation, and structural risk minimization. It also examines the statistical properties of CV estimators, including bias and variance, and their implications for model selection. The authors highlight the importance of choosing the appropriate CV procedure based on the specific characteristics of the problem, such as the type of model, the size of the dataset, and the goal of model selection (estimation or identification).
The paper also addresses the limitations of CV, including its dependence on assumptions about the data distribution and its computational complexity. It discusses various CV procedures, including hold-out, leave-one-out, and V-fold cross-validation, and their respective advantages and disadvantages. The authors conclude that while CV is a powerful tool for model selection, its effectiveness depends on the specific context and the goals of the model selection task. The survey aims to provide a clear understanding of the theoretical and empirical aspects of CV, helping researchers choose the most appropriate procedure for their specific needs.A survey of cross-validation procedures for model selection
Sylvain Arlot and Alain Celisse present a comprehensive review of cross-validation (CV) procedures for model selection, emphasizing their theoretical and empirical properties. CV is a widely used strategy for estimating the risk of an estimator or selecting the best model among several candidates. It works by splitting the data into training and validation sets to estimate the risk of each algorithm. The algorithm with the smallest estimated risk is selected. CV avoids overfitting by using independent training and validation samples, and its popularity stems from its generality and simplicity. However, theoretical and empirical studies show that CV is not universally effective, and some procedures may fail for specific model selection tasks, such as estimation or identification.
The paper discusses various model selection paradigms, including estimation and identification, and explores different model selection procedures, such as unbiased risk estimation, biased risk estimation, and structural risk minimization. It also examines the statistical properties of CV estimators, including bias and variance, and their implications for model selection. The authors highlight the importance of choosing the appropriate CV procedure based on the specific characteristics of the problem, such as the type of model, the size of the dataset, and the goal of model selection (estimation or identification).
The paper also addresses the limitations of CV, including its dependence on assumptions about the data distribution and its computational complexity. It discusses various CV procedures, including hold-out, leave-one-out, and V-fold cross-validation, and their respective advantages and disadvantages. The authors conclude that while CV is a powerful tool for model selection, its effectiveness depends on the specific context and the goals of the model selection task. The survey aims to provide a clear understanding of the theoretical and empirical aspects of CV, helping researchers choose the most appropriate procedure for their specific needs.