November 7, 2019 | Andrius Vabalas, Emma Gowen, Ellen Poliakoff, Alexander J. Casson
Machine learning (ML) algorithm validation with a limited sample size is a critical issue in research involving high-dimensional data, such as neuroimaging, genomics, and motion tracking. Small sample sizes are common due to the high cost of data collection involving human participants and the difficulty in recruiting large numbers of participants. High-dimensional data with small sample sizes can lead to biased performance estimates in ML models. This study investigates the impact of validation methods on performance estimates and identifies strategies to avoid overfitting.
The study reviews studies that applied ML to predict autistic from non-autistic individuals and found that small sample sizes were associated with higher reported classification accuracy. Simulations showed that K-fold cross-validation (CV) produces strongly biased performance estimates with small sample sizes, even with a sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. Feature selection performed on pooled training and testing data contributes to bias more than parameter tuning. The study also explored the contribution of data dimensionality, hyper-parameter space, and number of CV folds to bias.
The study used synthetic Gaussian noise data to investigate the impact of sample size, feature-to-sample ratio, and parameter tuning on overfitting. It found that the higher the feature-to-sample ratio, the more likely an ML model would fit the noise in the data rather than underlying patterns. The study also found that the number of adjustable parameters affects the likelihood of overfitting.
The study compared different validation methods, including K-fold CV, nested CV, and train/test split, and found that nested CV and train/test split produced unbiased performance estimates. The study also found that using discriminable data with different validation methods showed that K-fold CV produced significantly higher performance estimates compared to nested CV or train/test split.
The study concludes that robust validation methods are essential for ML research, especially when working with small datasets. It also highlights the importance of separating training and testing data to avoid overfitting and provides guidance on interpreting results from other studies based on the validation method used.Machine learning (ML) algorithm validation with a limited sample size is a critical issue in research involving high-dimensional data, such as neuroimaging, genomics, and motion tracking. Small sample sizes are common due to the high cost of data collection involving human participants and the difficulty in recruiting large numbers of participants. High-dimensional data with small sample sizes can lead to biased performance estimates in ML models. This study investigates the impact of validation methods on performance estimates and identifies strategies to avoid overfitting.
The study reviews studies that applied ML to predict autistic from non-autistic individuals and found that small sample sizes were associated with higher reported classification accuracy. Simulations showed that K-fold cross-validation (CV) produces strongly biased performance estimates with small sample sizes, even with a sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. Feature selection performed on pooled training and testing data contributes to bias more than parameter tuning. The study also explored the contribution of data dimensionality, hyper-parameter space, and number of CV folds to bias.
The study used synthetic Gaussian noise data to investigate the impact of sample size, feature-to-sample ratio, and parameter tuning on overfitting. It found that the higher the feature-to-sample ratio, the more likely an ML model would fit the noise in the data rather than underlying patterns. The study also found that the number of adjustable parameters affects the likelihood of overfitting.
The study compared different validation methods, including K-fold CV, nested CV, and train/test split, and found that nested CV and train/test split produced unbiased performance estimates. The study also found that using discriminable data with different validation methods showed that K-fold CV produced significantly higher performance estimates compared to nested CV or train/test split.
The study concludes that robust validation methods are essential for ML research, especially when working with small datasets. It also highlights the importance of separating training and testing data to avoid overfitting and provides guidance on interpreting results from other studies based on the validation method used.