Understanding AI models collapse when trained on recursively generated data

The paper "AI models collapse when trained on recursively generated data" by Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal explores the phenomenon of "model collapse" in large language models (LLMs) and other generative models. The authors investigate what happens when these models are trained on data generated by their predecessors, leading to a degenerative process where the models lose information about the true underlying distribution. This effect is observed in both LLMs and simpler models like variational autoencoders (VAEs) and Gaussian mixture models (GMMs). The paper provides theoretical intuition for model collapse, explaining that it arises from three main sources of error: statistical approximation error, functional expressivity error, and functional approximation error. These errors compound over generations, causing the models to lose information about the tails of the distribution and eventually converge to a distribution with reduced variance. The authors conduct experiments to demonstrate model collapse in language models, specifically using the OPT-125m model. They find that training on generated data leads to a degradation in performance, with models producing more probable sequences from the original data and introducing improbable sequences. The paper also discusses the broader implications of model collapse, emphasizing the importance of preserving access to original human-generated data for training newer models. In conclusion, the paper highlights the need for caution in the use of LLMs and the importance of maintaining the integrity of training data to ensure the models can accurately represent and learn from low-probability events, which are crucial for fairness and understanding complex systems.The paper "AI models collapse when trained on recursively generated data" by Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal explores the phenomenon of "model collapse" in large language models (LLMs) and other generative models. The authors investigate what happens when these models are trained on data generated by their predecessors, leading to a degenerative process where the models lose information about the true underlying distribution. This effect is observed in both LLMs and simpler models like variational autoencoders (VAEs) and Gaussian mixture models (GMMs). The paper provides theoretical intuition for model collapse, explaining that it arises from three main sources of error: statistical approximation error, functional expressivity error, and functional approximation error. These errors compound over generations, causing the models to lose information about the tails of the distribution and eventually converge to a distribution with reduced variance. The authors conduct experiments to demonstrate model collapse in language models, specifically using the OPT-125m model. They find that training on generated data leads to a degradation in performance, with models producing more probable sequences from the original data and introducing improbable sequences. The paper also discusses the broader implications of model collapse, emphasizing the importance of preserving access to original human-generated data for training newer models. In conclusion, the paper highlights the need for caution in the use of LLMs and the importance of maintaining the integrity of training data to ensure the models can accurately represent and learn from low-probability events, which are crucial for fairness and understanding complex systems.

AI models collapse when trained on recursively generated data

24 July 2024 | Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson & Yarin Gal