2018 | Georg Heinze | Christine Wallisch | Daniela Dunkler
Variable selection is a critical aspect of statistical modeling, particularly in fields like life sciences where the number of candidate variables can be large (10–30). This review discusses various variable selection methods, including significance criteria, information criteria, penalized likelihood, the change-in-estimate criterion, and background knowledge. These methods are often developed for linear regression models and adapted for generalized linear models or survival analysis. However, variable selection can compromise model stability, unbiasedness of regression coefficients, and the validity of p-values or confidence intervals. The review provides practical recommendations for statisticians on applying variable selection methods in low-dimensional modeling and performing stability investigations. It also suggests reporting quantities based on resampling the entire variable selection process to ensure transparency and reliability. Key considerations include the events-per-variable (EPV) ratio, which balances the amount of information in the data with the number of parameters to be estimated. The review emphasizes the importance of using background knowledge, model stability, and shrinkage techniques to improve model performance and interpretability. Recommendations include using backward elimination (BE) over forward selection (FS), considering EPV ratios, and employing resampling methods to assess model stability. The review concludes that variable selection should be used judiciously, with a focus on model interpretability and accuracy, and that robust methods are needed to address the challenges of high-dimensional data and model uncertainty.Variable selection is a critical aspect of statistical modeling, particularly in fields like life sciences where the number of candidate variables can be large (10–30). This review discusses various variable selection methods, including significance criteria, information criteria, penalized likelihood, the change-in-estimate criterion, and background knowledge. These methods are often developed for linear regression models and adapted for generalized linear models or survival analysis. However, variable selection can compromise model stability, unbiasedness of regression coefficients, and the validity of p-values or confidence intervals. The review provides practical recommendations for statisticians on applying variable selection methods in low-dimensional modeling and performing stability investigations. It also suggests reporting quantities based on resampling the entire variable selection process to ensure transparency and reliability. Key considerations include the events-per-variable (EPV) ratio, which balances the amount of information in the data with the number of parameters to be estimated. The review emphasizes the importance of using background knowledge, model stability, and shrinkage techniques to improve model performance and interpretability. Recommendations include using backward elimination (BE) over forward selection (FS), considering EPV ratios, and employing resampling methods to assess model stability. The review concludes that variable selection should be used judiciously, with a focus on model interpretability and accuracy, and that robust methods are needed to address the challenges of high-dimensional data and model uncertainty.