Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

March 13, 2024 | Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, Xian Li
Branch-Train-MiX (BTX) is a method for training large language models (LLMs) to excel in multiple specialized domains. The approach starts with a seed model, which is branched to train multiple expert models in parallel. These experts are then combined into a single model using a Mixture-of-Experts (MoE) architecture, with a routing mechanism to select the most relevant expert for each token. BTX improves upon both the Branch-Train-Merge (BTM) method and MoE approaches by enabling efficient, asynchronous training and combining the strengths of both. The final model is a unified neural network that can be further fine-tuned, offering better accuracy and efficiency compared to alternative methods. BTX achieves the best accuracy-efficiency tradeoff by training multiple expert models on different domains and combining them into a single MoE model. The method is tested on various tasks, including math, code, and world knowledge, and outperforms other approaches in performance and efficiency. The results show that BTX is more compute-efficient and robust to task interference, making it a promising approach for continued pretraining. The paper also discusses variations of the method, including different routing strategies and expert blending techniques, and highlights the benefits of MoE finetuning for learning token-level routing. Overall, BTX provides a more effective and efficient way to train LLMs for multiple domains.Branch-Train-MiX (BTX) is a method for training large language models (LLMs) to excel in multiple specialized domains. The approach starts with a seed model, which is branched to train multiple expert models in parallel. These experts are then combined into a single model using a Mixture-of-Experts (MoE) architecture, with a routing mechanism to select the most relevant expert for each token. BTX improves upon both the Branch-Train-Merge (BTM) method and MoE approaches by enabling efficient, asynchronous training and combining the strengths of both. The final model is a unified neural network that can be further fine-tuned, offering better accuracy and efficiency compared to alternative methods. BTX achieves the best accuracy-efficiency tradeoff by training multiple expert models on different domains and combining them into a single MoE model. The method is tested on various tasks, including math, code, and world knowledge, and outperforms other approaches in performance and efficiency. The results show that BTX is more compute-efficient and robust to task interference, making it a promising approach for continued pretraining. The paper also discusses variations of the method, including different routing strategies and expert blending techniques, and highlights the benefits of MoE finetuning for learning token-level routing. Overall, BTX provides a more effective and efficient way to train LLMs for multiple domains.
Reach us at info@study.space