1 Jul 2024 | Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, Min Lin
REGMIX is a method for automatically selecting an optimal data mixture for pre-training large language models (LLMs). The approach formulates the data mixture selection as a regression task, where small models are trained on various data mixtures, and a regression model is used to predict the performance of different mixtures. This allows for the identification of the best data mixture without exhaustive training. The method is validated by training 512 small models with 1M parameters on 1B tokens, which enables the prediction of the optimal mixture for training a larger model with 1B parameters and 25B tokens. The results show that REGMIX outperforms human selection and matches or surpasses the DoReMi method, using only 10% of the compute budget. The study also reveals that data mixtures significantly impact performance, with variations up to 14.6% in single-task performance. Web corpora, rather than high-quality data like Wikipedia, show the strongest positive correlation with downstream performance. Domain interactions are complex and often contradict common sense, highlighting the need for automated approaches like REGMIX. Additionally, data mixture effects transcend scaling laws, and REGMIX captures this complexity by considering all domains together. The method is efficient, scalable, and provides insights into domain interactions, making it a valuable tool for LLM pre-training.REGMIX is a method for automatically selecting an optimal data mixture for pre-training large language models (LLMs). The approach formulates the data mixture selection as a regression task, where small models are trained on various data mixtures, and a regression model is used to predict the performance of different mixtures. This allows for the identification of the best data mixture without exhaustive training. The method is validated by training 512 small models with 1M parameters on 1B tokens, which enables the prediction of the optimal mixture for training a larger model with 1B parameters and 25B tokens. The results show that REGMIX outperforms human selection and matches or surpasses the DoReMi method, using only 10% of the compute budget. The study also reveals that data mixtures significantly impact performance, with variations up to 14.6% in single-task performance. Web corpora, rather than high-quality data like Wikipedia, show the strongest positive correlation with downstream performance. Domain interactions are complex and often contradict common sense, highlighting the need for automated approaches like REGMIX. Additionally, data mixture effects transcend scaling laws, and REGMIX captures this complexity by considering all domains together. The method is efficient, scalable, and provides insights into domain interactions, making it a valuable tool for LLM pre-training.