Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

3 Jun 2024 | Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei Lü, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, Xiaokun Wang, Yutuan Ma, Rui Hu, Shuicheng Yan, Han Fang, Yahui Zhou
This technical report introduces Skywork-MoE, a high-performance mixture-of-experts (MoE) large language model (LLM) with 146 billion parameters and 16 experts. The model is initialized from the dense checkpoints of the Skywork-13B model, exploring the effectiveness of upcycling versus training from scratch. Key findings suggest that the choice between these approaches should consider both the performance of existing dense checkpoints and the MoE training budget. Two innovative techniques are highlighted: gating logit normalization, which improves expert diversification, and adaptive auxiliary loss coefficients, which allow for layer-specific adjustment of auxiliary loss coefficients. Experimental results validate the effectiveness of these methods, demonstrating strong performance across various benchmarks. The training of Skywork-MoE was conducted on a condensed subset of the SkyPile corpus, showcasing its robustness and efficiency in large-scale language processing tasks.This technical report introduces Skywork-MoE, a high-performance mixture-of-experts (MoE) large language model (LLM) with 146 billion parameters and 16 experts. The model is initialized from the dense checkpoints of the Skywork-13B model, exploring the effectiveness of upcycling versus training from scratch. Key findings suggest that the choice between these approaches should consider both the performance of existing dense checkpoints and the MoE training budget. Two innovative techniques are highlighted: gating logit normalization, which improves expert diversification, and adaptive auxiliary loss coefficients, which allow for layer-specific adjustment of auxiliary loss coefficients. Experimental results validate the effectiveness of these methods, demonstrating strong performance across various benchmarks. The training of Skywork-MoE was conducted on a condensed subset of the SkyPile corpus, showcasing its robustness and efficiency in large-scale language processing tasks.
Reach us at info@study.space