OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models

OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models

27 Mar 2024 | Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, Yang You
OpenMoE is an open-sourced series of decoder-only Mixture-of-Experts (MoE) language models, ranging from 650M to 34B parameters, trained on up to 1T tokens. The study investigates the effectiveness of MoE-based LLMs, finding that they offer a better cost-effectiveness trade-off compared to dense models. Key findings include Context-Independent Specialization, Early Routing Learning, and Drop-towards-the-End. MoE models primarily route tokens based on token IDs, with early pre-training decisions leading to fixed token assignments. This can cause performance degradation in sequential tasks like multi-turn conversations. The study also proposes strategies to mitigate these issues and improve MoE LLM designs. The paper presents three main goals: (1) training a decoder-only MoE model within existing LLM frameworks, (2) analyzing MoE routing mechanisms, and (3) advancing future MoE LLM development. OpenMoE models, including OpenMoE-Base/16E, OpenMoE-8B/32E, and OpenMoE-34B/32E, are released with detailed configurations. These models achieve comparable performance to dense open-source LLMs, with OpenMoE-8B/32E-Chat outperforming dense models on single-turn conversations. The study explores advanced training strategies, including using code data and UL2 training objectives. It also analyzes MoE routing behavior, finding that experts specialize based on token IDs rather than context. This leads to issues like Drop-towards-the-End, where later tokens are more likely to be dropped. The paper also evaluates other MoE models like Mixtral and DeepSeek-MoE, finding similar issues. The study suggests improvements for future MoE models, including using a more balanced code data ratio, optimizing tokenizers, and improving MoE architecture. It also proposes mixing instruction-following data during pre-training to alleviate Drop-towards-the-End issues. The paper concludes that MoE models offer significant benefits for open-source communities, but further research is needed to address current limitations.OpenMoE is an open-sourced series of decoder-only Mixture-of-Experts (MoE) language models, ranging from 650M to 34B parameters, trained on up to 1T tokens. The study investigates the effectiveness of MoE-based LLMs, finding that they offer a better cost-effectiveness trade-off compared to dense models. Key findings include Context-Independent Specialization, Early Routing Learning, and Drop-towards-the-End. MoE models primarily route tokens based on token IDs, with early pre-training decisions leading to fixed token assignments. This can cause performance degradation in sequential tasks like multi-turn conversations. The study also proposes strategies to mitigate these issues and improve MoE LLM designs. The paper presents three main goals: (1) training a decoder-only MoE model within existing LLM frameworks, (2) analyzing MoE routing mechanisms, and (3) advancing future MoE LLM development. OpenMoE models, including OpenMoE-Base/16E, OpenMoE-8B/32E, and OpenMoE-34B/32E, are released with detailed configurations. These models achieve comparable performance to dense open-source LLMs, with OpenMoE-8B/32E-Chat outperforming dense models on single-turn conversations. The study explores advanced training strategies, including using code data and UL2 training objectives. It also analyzes MoE routing behavior, finding that experts specialize based on token IDs rather than context. This leads to issues like Drop-towards-the-End, where later tokens are more likely to be dropped. The paper also evaluates other MoE models like Mixtral and DeepSeek-MoE, finding similar issues. The study suggests improvements for future MoE models, including using a more balanced code data ratio, optimizing tokenizers, and improving MoE architecture. It also proposes mixing instruction-following data during pre-training to alleviate Drop-towards-the-End issues. The paper concludes that MoE models offer significant benefits for open-source communities, but further research is needed to address current limitations.
Reach us at info@study.space
[slides] OpenMoE%3A An Early Effort on Open Mixture-of-Experts Language Models | StudySpace