Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

25 Mar 2024 | Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan, Xipeng Qiu
This paper introduces data mixing laws, which quantify how model performance changes with different data mixtures. Large language models (LLMs) are pre-trained on multiple domains, and the proportions of these domains significantly affect model performance. Existing methods rely on heuristics to tune these proportions, but the authors discover that model performance can be predicted using mathematical functions, termed data mixing laws. These laws allow for the prediction of model performance on unseen data mixtures before actual training, enabling the selection of optimal data mixtures. The authors further propose a nested approach combining scaling laws for training steps, model sizes, and data mixing laws to predict performance on large models trained on massive data with small-scale experiments. Experimental results show that their method effectively optimizes the training mixture of a 1B model trained on 100B tokens, achieving performance comparable to a model trained for 48% more steps on the default mixture. The application of data mixing laws to continual training accurately predicts the critical mixture proportion that avoids catastrophic forgetting, revealing the potential for dynamic data scheduling. The paper also discusses the implications of data mixing laws for pretraining data curation, showing how they can guide the design of data schedules and improve model performance. The findings demonstrate that data mixing laws provide a quantitative framework for optimizing data mixtures, leading to better performance in pretraining and maintaining the original abilities of pretrained models in continual pretraining. The study highlights the importance of data mixing laws in the development of large language models and their potential for future research in data curation and model optimization.This paper introduces data mixing laws, which quantify how model performance changes with different data mixtures. Large language models (LLMs) are pre-trained on multiple domains, and the proportions of these domains significantly affect model performance. Existing methods rely on heuristics to tune these proportions, but the authors discover that model performance can be predicted using mathematical functions, termed data mixing laws. These laws allow for the prediction of model performance on unseen data mixtures before actual training, enabling the selection of optimal data mixtures. The authors further propose a nested approach combining scaling laws for training steps, model sizes, and data mixing laws to predict performance on large models trained on massive data with small-scale experiments. Experimental results show that their method effectively optimizes the training mixture of a 1B model trained on 100B tokens, achieving performance comparable to a model trained for 48% more steps on the default mixture. The application of data mixing laws to continual training accurately predicts the critical mixture proportion that avoids catastrophic forgetting, revealing the potential for dynamic data scheduling. The paper also discusses the implications of data mixing laws for pretraining data curation, showing how they can guide the design of data schedules and improve model performance. The findings demonstrate that data mixing laws provide a quantitative framework for optimizing data mixtures, leading to better performance in pretraining and maintaining the original abilities of pretrained models in continual pretraining. The study highlights the importance of data mixing laws in the development of large language models and their potential for future research in data curation and model optimization.
Reach us at info@study.space
[slides] Data Mixing Laws%3A Optimizing Data Mixtures by Predicting Language Modeling Performance | StudySpace