[slides] RegMix%3A Data Mixture as Regression for Language Model Pre-training

The paper introduces REGMIX, a novel approach to automatically selecting the optimal data mixture for pre-training large language models (LLMs). REGMIX formulates the data mixture problem as a regression task, training small models to predict the impact of different data mixtures. This method efficiently identifies the best mixture, which is then generalized to large-scale model training. The authors train 512 small models with 1M parameters on 1B tokens to fit a regression model, predicting the optimal data mixture among 64 models with 1B parameters trained on 25B tokens. The optimized mixture outperforms human selection and matches or surpasses the performance of DoReMi, a state-of-the-art method, while using only 10% of the compute budget. The paper also highlights several key findings: 1. Data mixtures significantly impact downstream performance, with variations of up to 14.6%. 2. Web corpora, such as CommonCrawl, exhibit the strongest positive correlation with improved performance across downstream tasks. 3. Domain interactions are complex and often contradict common intuition, emphasizing the need for automated approaches like REGMIX. 4. Data mixture effects transcend scaling laws, and REGMIX captures this complexity by considering all domains together. REGMIX is designed for from-scratch pre-training and is more scalable compared to previous methods that require training a single model for hundreds of thousands of steps. The paper provides a detailed experimental setup, evaluation metrics, and results, demonstrating the effectiveness of REGMIX in both regression prediction and downstream tasks.The paper introduces REGMIX, a novel approach to automatically selecting the optimal data mixture for pre-training large language models (LLMs). REGMIX formulates the data mixture problem as a regression task, training small models to predict the impact of different data mixtures. This method efficiently identifies the best mixture, which is then generalized to large-scale model training. The authors train 512 small models with 1M parameters on 1B tokens to fit a regression model, predicting the optimal data mixture among 64 models with 1B parameters trained on 25B tokens. The optimized mixture outperforms human selection and matches or surpasses the performance of DoReMi, a state-of-the-art method, while using only 10% of the compute budget. The paper also highlights several key findings: 1. Data mixtures significantly impact downstream performance, with variations of up to 14.6%. 2. Web corpora, such as CommonCrawl, exhibit the strongest positive correlation with improved performance across downstream tasks. 3. Domain interactions are complex and often contradict common intuition, emphasizing the need for automated approaches like REGMIX. 4. Data mixture effects transcend scaling laws, and REGMIX captures this complexity by considering all domains together. REGMIX is designed for from-scratch pre-training and is more scalable compared to previous methods that require training a single model for hundreds of thousands of steps. The paper provides a detailed experimental setup, evaluation metrics, and results, demonstrating the effectiveness of REGMIX in both regression prediction and downstream tasks.

REGMIX: Data Mixture as Regression for Language Model Pre-training

1 Jul 2024 | Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, Min Lin