[slides and audio] A Tale of Tails%3A Model Collapse as a Change of Scaling Laws

The paper explores the impact of synthetic data on neural scaling laws and models, particularly in the context of large language models (LLMs) and generative AI. As synthetic data becomes increasingly prevalent, the authors investigate how this affects the performance and capabilities of models. They develop a theoretical framework to analyze model collapse, a phenomenon where models lose their ability to learn new skills or improve with more data. The study identifies several decay phenomena, including loss of scaling, shifted scaling with the number of generations, and the "un-learning" of skills when mixing human and synthetic data. The theoretical findings are validated through large-scale experiments using a transformer on an arithmetic task and a large language model (Llama2) for text generation. The paper also discusses mitigation strategies, such as mixing a small amount of clean data with synthetic data to mitigate model collapse and introduce a "grokking" phenomenon, where the model's performance plateaus before improving again. The results highlight the need for responsible data management and the potential risks of relying solely on synthetic data in AI model training.The paper explores the impact of synthetic data on neural scaling laws and models, particularly in the context of large language models (LLMs) and generative AI. As synthetic data becomes increasingly prevalent, the authors investigate how this affects the performance and capabilities of models. They develop a theoretical framework to analyze model collapse, a phenomenon where models lose their ability to learn new skills or improve with more data. The study identifies several decay phenomena, including loss of scaling, shifted scaling with the number of generations, and the "un-learning" of skills when mixing human and synthetic data. The theoretical findings are validated through large-scale experiments using a transformer on an arithmetic task and a large language model (Llama2) for text generation. The paper also discusses mitigation strategies, such as mixing a small amount of clean data with synthetic data to mitigate model collapse and introduce a "grokking" phenomenon, where the model's performance plateaus before improving again. The results highlight the need for responsible data management and the potential risks of relying solely on synthetic data in AI model training.

A Tale of Tails: Model Collapse as a Change of Scaling Laws

31 May 2024 | Elvis Dohmatob * 1 Yunzhen Feng * 2 Pu Yang 3 Francois Charton 1 Julia Kempe 2 4 1