Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining

Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining

11 Jul 2024 | Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding
This paper introduces a bivariate scaling law, BiMix, for language model (LM) data mixing. The law models the combined effects of data quantity and mixing proportions on model performance. The authors propose BiMix to efficiently optimize data mixtures, reducing the need for resource-intensive training processes. They demonstrate that entropy-driven data mixtures can achieve comparable or better performance than more resource-intensive methods. BiMix is derived from a scaling framework and is validated through extensive experiments. The law quantitatively characterizes how data quantity and mixing proportions affect validation loss, providing interpretable and extensible results. The authors also incorporate entropy proxies to determine the coefficients of BiMix and apply the law to scalable training scenarios. They show that BiMix can be precisely fitted across diverse datasets and effectively used in data mixture optimization, leading to faster convergence and better downstream task performance. The study highlights the importance of data diversity and provides practical insights for cost-effective language modeling. The results demonstrate that BiMix can be used to optimize data mixtures without prior training, offering a training-free solution for efficient data mixing. The paper also discusses the broader implications of their findings for AI development, emphasizing the need for more economical and environmentally friendly approaches.This paper introduces a bivariate scaling law, BiMix, for language model (LM) data mixing. The law models the combined effects of data quantity and mixing proportions on model performance. The authors propose BiMix to efficiently optimize data mixtures, reducing the need for resource-intensive training processes. They demonstrate that entropy-driven data mixtures can achieve comparable or better performance than more resource-intensive methods. BiMix is derived from a scaling framework and is validated through extensive experiments. The law quantitatively characterizes how data quantity and mixing proportions affect validation loss, providing interpretable and extensible results. The authors also incorporate entropy proxies to determine the coefficients of BiMix and apply the law to scalable training scenarios. They show that BiMix can be precisely fitted across diverse datasets and effectively used in data mixture optimization, leading to faster convergence and better downstream task performance. The study highlights the importance of data diversity and provides practical insights for cost-effective language modeling. The results demonstrate that BiMix can be used to optimize data mixtures without prior training, offering a training-free solution for efficient data mixing. The paper also discusses the broader implications of their findings for AI development, emphasizing the need for more economical and environmentally friendly approaches.
Reach us at info@study.space
Understanding BiMix%3A A Bivariate Data Mixing Law for Language Model Pretraining