[slides] BiMix%3A A Bivariate Data Mixing Law for Language Model Pretraining

This paper addresses the challenge of integrating diverse data sources for language model pretraining, which is crucial for enhancing model performance and generalization. Traditional methods rely on heuristic schemes, lacking theoretical guidance. To tackle this, the authors propose a new bivariate scaling law called BiMiX, which models the impact of both data quantity and mixing proportions on model training outcomes. BiMiX is designed to be efficient and scalable, providing a quantitative framework for optimizing data mixtures. The key contributions of the paper include: 1. **Introduction of BiMiX**: A unified scaling law that accurately models the bivariate scaling behaviors of data quantity and mixing proportions. 2. **Empirical Validation**: Systematic experiments demonstrate the predictive power and fundamental principles of BiMiX, showing that entropy-driven data mixtures can achieve comparable or better performance than more resource-intensive methods. 3. **Practical Insights**: The paper provides practical strategies for efficient data mixing, including selecting the most promising data mixtures without prior training and optimizing mixing proportions using customized constraints. The authors evaluate BiMiX on two domain-diverse datasets, *The Pile* and *SlimPajama*, and compare it with baseline and DoReMi methods. Results show that BiMiX not only fits observations accurately but also predicts validation losses effectively. The paper also discusses the practical implications of BiMiX, such as mixture selection and proportion optimization, and highlights its potential for efficient and robust optimization of data mixtures in language model pretraining. Overall, the research advances the field of language model pretraining by providing a systematic and scalable approach to optimizing data mixtures, leading to more efficient and effective training processes.This paper addresses the challenge of integrating diverse data sources for language model pretraining, which is crucial for enhancing model performance and generalization. Traditional methods rely on heuristic schemes, lacking theoretical guidance. To tackle this, the authors propose a new bivariate scaling law called BiMiX, which models the impact of both data quantity and mixing proportions on model training outcomes. BiMiX is designed to be efficient and scalable, providing a quantitative framework for optimizing data mixtures. The key contributions of the paper include: 1. **Introduction of BiMiX**: A unified scaling law that accurately models the bivariate scaling behaviors of data quantity and mixing proportions. 2. **Empirical Validation**: Systematic experiments demonstrate the predictive power and fundamental principles of BiMiX, showing that entropy-driven data mixtures can achieve comparable or better performance than more resource-intensive methods. 3. **Practical Insights**: The paper provides practical strategies for efficient data mixing, including selecting the most promising data mixtures without prior training and optimizing mixing proportions using customized constraints. The authors evaluate BiMiX on two domain-diverse datasets, *The Pile* and *SlimPajama*, and compare it with baseline and DoReMi methods. Results show that BiMiX not only fits observations accurately but also predicts validation losses effectively. The paper also discusses the practical implications of BiMiX, such as mixture selection and proportion optimization, and highlights its potential for efficient and robust optimization of data mixtures in language model pretraining. Overall, the research advances the field of language model pretraining by providing a systematic and scalable approach to optimizing data mixtures, leading to more efficient and effective training processes.

Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining

11 Jul 2024 | Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding