2024-04-30 | Elvis Dohmatob, Yunzhen Feng, Julia Kempe
This paper investigates the phenomenon of model collapse in high-dimensional regression, where training a model on data generated recursively from its own outputs leads to a degradation in performance. The study focuses on ridge regression with Gaussian design and derives analytic formulas that quantify the impact of synthetic data generation on test error. The key findings include:
1. **Exact Characterization of Test Error**: The test error under iterative retraining on synthetic data is decomposed into a clean error term and an additional term that depends on the number of generations and the properties of the data and model. This term highlights the negative effects of synthetic data generation on model performance.
2. **Modified Scaling Laws**: In the case of power-law spectra, the study derives new scaling laws that show how the test error changes with the number of generations and the parameters of the data and model. These laws reveal a crossover from fast to slow error rates as the number of generations increases.
3. **Adaptive Regularization**: The paper proposes a strategy based on adaptive regularization to mitigate model collapse. This approach adjusts the regularization parameter based on the number of generations and the properties of the data, leading to improved performance.
4. **Empirical Validation**: The theoretical results are validated through experiments, showing that the proposed regularization strategy effectively reduces the test error in the presence of synthetic data.
The study provides a theoretical understanding of model collapse in high-dimensional regression, highlighting the importance of regularization and the impact of synthetic data generation on model performance. The results have implications for the design and training of large language and image generation models, where model collapse can lead to significant degradation in performance.This paper investigates the phenomenon of model collapse in high-dimensional regression, where training a model on data generated recursively from its own outputs leads to a degradation in performance. The study focuses on ridge regression with Gaussian design and derives analytic formulas that quantify the impact of synthetic data generation on test error. The key findings include:
1. **Exact Characterization of Test Error**: The test error under iterative retraining on synthetic data is decomposed into a clean error term and an additional term that depends on the number of generations and the properties of the data and model. This term highlights the negative effects of synthetic data generation on model performance.
2. **Modified Scaling Laws**: In the case of power-law spectra, the study derives new scaling laws that show how the test error changes with the number of generations and the parameters of the data and model. These laws reveal a crossover from fast to slow error rates as the number of generations increases.
3. **Adaptive Regularization**: The paper proposes a strategy based on adaptive regularization to mitigate model collapse. This approach adjusts the regularization parameter based on the number of generations and the properties of the data, leading to improved performance.
4. **Empirical Validation**: The theoretical results are validated through experiments, showing that the proposed regularization strategy effectively reduces the test error in the presence of synthetic data.
The study provides a theoretical understanding of model collapse in high-dimensional regression, highlighting the importance of regularization and the impact of synthetic data generation on model performance. The results have implications for the design and training of large language and image generation models, where model collapse can lead to significant degradation in performance.