25 July 2024 | Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson & Yarin Gal
AI models trained on recursively generated data experience a phenomenon called 'model collapse', where the resulting models lose the ability to represent the true underlying data distribution. This occurs because the models are trained on data generated by previous models, leading to a degradation in their ability to capture the full range of the original distribution. The process is observed in various generative models, including large language models (LLMs), variational autoencoders (VAEs), and Gaussian mixture models (GMMs). The collapse is driven by three types of errors: statistical approximation error, functional expressivity error, and functional approximation error. These errors compound over generations, causing the models to converge to a distribution that is increasingly different from the original one, often with reduced variance.
Theoretical analysis shows that model collapse is a universal phenomenon in generative models that recursively train on data generated by previous generations. In the case of discrete distributions, model collapse leads to the loss of information about low-probability events, eventually resulting in a delta function distribution. In the case of multidimensional Gaussian distributions, model collapse results in a collapse to zero variance as the number of generations increases.
The study also demonstrates that model collapse can occur in language models when they are trained on data generated by other models. Experiments show that models trained on generated data can still learn some aspects of the original task, but with errors. The results indicate that the use of LLMs to generate content on the Internet can pollute the data used to train future models, leading to a loss of information about the original data distribution.
The implications of model collapse are significant, as it affects the ability of models to accurately represent the true underlying data distribution. This is particularly important for tasks where the tails of the distribution are important, such as understanding complex systems or ensuring fair predictions. The study highlights the need to preserve access to original data sources and to distinguish data generated by LLMs from other data to ensure the continued effectiveness of training models on large-scale data.AI models trained on recursively generated data experience a phenomenon called 'model collapse', where the resulting models lose the ability to represent the true underlying data distribution. This occurs because the models are trained on data generated by previous models, leading to a degradation in their ability to capture the full range of the original distribution. The process is observed in various generative models, including large language models (LLMs), variational autoencoders (VAEs), and Gaussian mixture models (GMMs). The collapse is driven by three types of errors: statistical approximation error, functional expressivity error, and functional approximation error. These errors compound over generations, causing the models to converge to a distribution that is increasingly different from the original one, often with reduced variance.
Theoretical analysis shows that model collapse is a universal phenomenon in generative models that recursively train on data generated by previous generations. In the case of discrete distributions, model collapse leads to the loss of information about low-probability events, eventually resulting in a delta function distribution. In the case of multidimensional Gaussian distributions, model collapse results in a collapse to zero variance as the number of generations increases.
The study also demonstrates that model collapse can occur in language models when they are trained on data generated by other models. Experiments show that models trained on generated data can still learn some aspects of the original task, but with errors. The results indicate that the use of LLMs to generate content on the Internet can pollute the data used to train future models, leading to a loss of information about the original data distribution.
The implications of model collapse are significant, as it affects the ability of models to accurately represent the true underlying data distribution. This is particularly important for tasks where the tails of the distribution are important, such as understanding complex systems or ensuring fair predictions. The study highlights the need to preserve access to original data sources and to distinguish data generated by LLMs from other data to ensure the continued effectiveness of training models on large-scale data.