[slides] Model Collapse Demystified%3A The Case of Regression

This paper explores the phenomenon of "model collapse" in high-dimensional regression, where the performance of a model degrades as it is recursively trained on data generated from itself. The authors study this issue in the context of kernel ridge regression with Gaussian design and derive analytic formulae to quantify the degradation. They find that the test error can be decomposed into three components: the usual test error on clean data, a scaling term that depends on the number of generations and the spectral properties of the kernel, and an additional bias term that increases with the number of generations. The paper also introduces a strategy based on adaptive regularization to mitigate model collapse. The theoretical findings are validated through experiments. Key contributions include: 1. **Exact Test Error Characterization**: The authors derive an exact formula for the test error under iterative retraining on synthesized data, highlighting the effects of various parameters such as sample size, regularization, and label noise levels. 2. **Modified Scaling Laws**: For the special case of power-law spectra of the covariance matrix, they derive new scaling laws for the test error, which exhibit crossover phenomena from fast to slow rates. 3. **Optimal Regularization**: They propose an optimal ridge regularization parameter that balances the negative effects of training on synthetic data, showing that the classical theory on clean data may lead to catastrophic failure in the presence of synthetic data. The paper provides a comprehensive theoretical framework to understand and address model collapse in high-dimensional regression, offering practical strategies to mitigate its impact.This paper explores the phenomenon of "model collapse" in high-dimensional regression, where the performance of a model degrades as it is recursively trained on data generated from itself. The authors study this issue in the context of kernel ridge regression with Gaussian design and derive analytic formulae to quantify the degradation. They find that the test error can be decomposed into three components: the usual test error on clean data, a scaling term that depends on the number of generations and the spectral properties of the kernel, and an additional bias term that increases with the number of generations. The paper also introduces a strategy based on adaptive regularization to mitigate model collapse. The theoretical findings are validated through experiments. Key contributions include: 1. **Exact Test Error Characterization**: The authors derive an exact formula for the test error under iterative retraining on synthesized data, highlighting the effects of various parameters such as sample size, regularization, and label noise levels. 2. **Modified Scaling Laws**: For the special case of power-law spectra of the covariance matrix, they derive new scaling laws for the test error, which exhibit crossover phenomena from fast to slow rates. 3. **Optimal Regularization**: They propose an optimal ridge regularization parameter that balances the negative effects of training on synthetic data, showing that the classical theory on clean data may lead to catastrophic failure in the presence of synthetic data. The paper provides a comprehensive theoretical framework to understand and address model collapse in high-dimensional regression, offering practical strategies to mitigate its impact.

Model Collapse Demystified: The Case of Regression

30 Apr 2024 | Elvis Dohmatob, Yunzhen Feng, Julia Kempe