[slides and audio] OpenMoE%3A An Early Effort on Open Mixture-of-Experts Language Models

OpenMoE is an open-sourced series of decoder-only Mixture-of-Experts (MoE) language models, ranging from 650M to 34B parameters, trained on up to 1T tokens. The study aims to provide a detailed solution for training MoE models, analyze their routing mechanisms, and promote future MoE LLM development. OpenMoE models are trained with a significant proportion of code data, which is crucial for complex reasoning and efficient communication. The models use a combination of UL2 and CasualLM training objectives, and are evaluated on various benchmarks, showing comparable or superior performance to dense models. The study also reveals three key findings: Context-Independent Specialization, Early Routing Learning, and Drop-towards-the-End. These findings indicate that MoE models often route tokens based on token IDs rather than context, leading to performance degradation in sequential tasks. The study also proposes potential solutions to mitigate these issues, such as adjusting the code data proportion and improving routing mechanisms. OpenMoE models are found to be scalable and efficient, with a better cost-effectiveness trade-off compared to dense models. The study highlights the importance of code data in training MoE models and suggests that future research should focus on improving routing behavior and addressing the Drop-towards-the-End issue. Overall, OpenMoE provides valuable insights into the behavior and potential of MoE-based LLMs, and offers a foundation for further research and development in this area.OpenMoE is an open-sourced series of decoder-only Mixture-of-Experts (MoE) language models, ranging from 650M to 34B parameters, trained on up to 1T tokens. The study aims to provide a detailed solution for training MoE models, analyze their routing mechanisms, and promote future MoE LLM development. OpenMoE models are trained with a significant proportion of code data, which is crucial for complex reasoning and efficient communication. The models use a combination of UL2 and CasualLM training objectives, and are evaluated on various benchmarks, showing comparable or superior performance to dense models. The study also reveals three key findings: Context-Independent Specialization, Early Routing Learning, and Drop-towards-the-End. These findings indicate that MoE models often route tokens based on token IDs rather than context, leading to performance degradation in sequential tasks. The study also proposes potential solutions to mitigate these issues, such as adjusting the code data proportion and improving routing mechanisms. OpenMoE models are found to be scalable and efficient, with a better cost-effectiveness trade-off compared to dense models. The study highlights the importance of code data in training MoE models and suggests that future research should focus on improving routing behavior and addressing the Drop-towards-the-End issue. Overall, OpenMoE provides valuable insights into the behavior and potential of MoE-based LLMs, and offers a foundation for further research and development in this area.

OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models

27 Mar 2024 | Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, Yang You