Understanding Data Mixing Laws%3A Optimizing Data Mixtures by Predicting Language Modeling Performance

The paper "Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance" by Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan, and Xipeng Qiu explores the optimization of data mixture proportions in pretraining large language models (LLMs). The authors discover that model performance can be quantitatively predicted based on the mixture proportions, leading to the concept of *data mixing laws*. These laws are functional relationships that describe how model validation losses vary with different mixture proportions. The paper proposes a method to predict model performance on unseen mixtures before actual training, which can guide the selection of optimal data mixtures. Key contributions include: 1. **Discovery of Data Mixing Laws**: The authors find that the validation loss of a model trained on a mixture of multiple domains can be accurately predicted using an exponential function over the linear combination of domain proportions. 2. **Nested Scaling Laws**: They propose a pipeline that combines scaling laws for training steps, model sizes, and data mixing laws to predict model performance on large models trained on massive data with only small-scale experiments. 3. **Experimental Validation**: The method is validated through experiments on a 1B model trained on 100B tokens in RedPajama, achieving performance comparable to a model trained on the default mixture for 48% more steps. 4. **Continual Pretraining**: The data mixing laws are also applied to continual pretraining, accurately predicting the critical mixture proportion that avoids catastrophic forgetting. The paper highlights the practical importance of optimizing data mixtures and provides a quantitative framework for predicting model performance, reducing the need for extensive experiments.The paper "Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance" by Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan, and Xipeng Qiu explores the optimization of data mixture proportions in pretraining large language models (LLMs). The authors discover that model performance can be quantitatively predicted based on the mixture proportions, leading to the concept of *data mixing laws*. These laws are functional relationships that describe how model validation losses vary with different mixture proportions. The paper proposes a method to predict model performance on unseen mixtures before actual training, which can guide the selection of optimal data mixtures. Key contributions include: 1. **Discovery of Data Mixing Laws**: The authors find that the validation loss of a model trained on a mixture of multiple domains can be accurately predicted using an exponential function over the linear combination of domain proportions. 2. **Nested Scaling Laws**: They propose a pipeline that combines scaling laws for training steps, model sizes, and data mixing laws to predict model performance on large models trained on massive data with only small-scale experiments. 3. **Experimental Validation**: The method is validated through experiments on a 1B model trained on 100B tokens in RedPajama, achieving performance comparable to a model trained on the default mixture for 48% more steps. 4. **Continual Pretraining**: The data mixing laws are also applied to continual pretraining, accurately predicting the critical mixture proportion that avoids catastrophic forgetting. The paper highlights the practical importance of optimizing data mixtures and provides a quantitative framework for predicting model performance, reducing the need for extensive experiments.

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

25 Mar 2024 | Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan, Xipeng Qiu