2024 | Elvis Dohmatob, Yunzhen Feng, Pu Yang, Francois Charton, Julia Kempe
This paper investigates the impact of synthetic data on neural scaling laws and the phenomenon of model collapse in large language models (LLMs). As AI models grow in size, neural scaling laws have been used to predict improvements in model performance with increased training data and model capacity. However, the increasing prevalence of synthetic data generated by AI models is changing the training landscape, potentially leading to model collapse. The authors propose a theoretical framework to analyze how synthetic data affects scaling laws and model performance.
The paper identifies several phenomena, including loss of scaling, shifted scaling with the number of generations, "un-learning" of skills, and grokking when mixing human and synthetic data. These effects are validated through experiments with a transformer on an arithmetic task and text generation using the large language model Llama2. The authors derive new scaling laws that explain model collapse in simplified models and show that the presence of synthetic data can lead to a loss of scaling and degradation in model performance.
The paper also explores the effects of tail-cutting and tail-narrowing in AI-generated data, which can alter the distribution of data and affect model performance. The authors propose a triplet scaling law for memory-limited models, which accounts for the effects of tail-cutting, embedding dimension, and frequency cutoff. They also show that model collapse can occur over multiple generations of AI data generation, with the error increasing as the number of generations increases.
The authors propose mitigation strategies, such as mixing AI-generated data with clean data, which can help prevent model collapse by introducing a grokking phenomenon where test error plateaus and then decreases. They also discuss the benefits of carefully curating "tail" data to mitigate the effects of AI-generated data.
The paper concludes that the presence of synthetic data in training data can lead to a loss of scaling laws and model collapse, highlighting the need for responsible use of synthetic data and the importance of preserving clean data for AI training. The authors emphasize the need for further research into watermarking synthetic data to distinguish it from human-annotated data and to ensure the long-term sustainability of AI models.This paper investigates the impact of synthetic data on neural scaling laws and the phenomenon of model collapse in large language models (LLMs). As AI models grow in size, neural scaling laws have been used to predict improvements in model performance with increased training data and model capacity. However, the increasing prevalence of synthetic data generated by AI models is changing the training landscape, potentially leading to model collapse. The authors propose a theoretical framework to analyze how synthetic data affects scaling laws and model performance.
The paper identifies several phenomena, including loss of scaling, shifted scaling with the number of generations, "un-learning" of skills, and grokking when mixing human and synthetic data. These effects are validated through experiments with a transformer on an arithmetic task and text generation using the large language model Llama2. The authors derive new scaling laws that explain model collapse in simplified models and show that the presence of synthetic data can lead to a loss of scaling and degradation in model performance.
The paper also explores the effects of tail-cutting and tail-narrowing in AI-generated data, which can alter the distribution of data and affect model performance. The authors propose a triplet scaling law for memory-limited models, which accounts for the effects of tail-cutting, embedding dimension, and frequency cutoff. They also show that model collapse can occur over multiple generations of AI data generation, with the error increasing as the number of generations increases.
The authors propose mitigation strategies, such as mixing AI-generated data with clean data, which can help prevent model collapse by introducing a grokking phenomenon where test error plateaus and then decreases. They also discuss the benefits of carefully curating "tail" data to mitigate the effects of AI-generated data.
The paper concludes that the presence of synthetic data in training data can lead to a loss of scaling laws and model collapse, highlighting the need for responsible use of synthetic data and the importance of preserving clean data for AI training. The authors emphasize the need for further research into watermarking synthetic data to distinguish it from human-annotated data and to ensure the long-term sustainability of AI models.