[slides and audio] LLaMA-MoE%3A Building Mixture-of-Experts from LLaMA with Continual Pre-Training

The paper "LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training" by Tong Zhu et al. explores the construction of Mixture-of-Experts (MoE) models from existing dense large language models (LLMs), specifically the LLaMA-2 7B model. The authors aim to reduce the computational costs associated with training MoE models from scratch while maintaining or improving language capabilities. They propose two main steps: *Expert Construction* and *Continual Pre-training*. 1. **Expert Construction**: The feed-forward networks (FFNs) in the LLaMA model are partitioned into multiple experts. The authors explore different methods for this partitioning, including neuron-independent and neuron-sharing approaches. The neuron-independent methods involve randomly splitting the FFN into equal-sized sets or using k-means clustering. The neuron-sharing methods consider the importance of intermediate neurons and partition them based on their contribution to the model's performance. 2. **Continual Pre-training**: After constructing the MoE model, the authors continue training it with additional gate networks to recover its language modeling abilities. They investigate various data sampling strategies and data quality filtering methods to improve training efficiency and performance. The training objective is the same as the original LLaMA model, and the training budget is 200B tokens. The paper evaluates the effectiveness of the proposed LLaMA-MoE models on a variety of downstream tasks, demonstrating that LLaMA-MoE-3.5B models significantly outperform dense models with similar activation parameters. The authors also conduct ablation studies to understand the impact of different expert construction methods and data sampling strategies. They find that optimized static data sampling weights and data filtering can significantly improve performance. Overall, the paper contributes to the field by providing a framework for building MoE models from existing dense LLMs, reducing training costs while maintaining or improving language capabilities. The source code and models are available on GitHub.The paper "LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training" by Tong Zhu et al. explores the construction of Mixture-of-Experts (MoE) models from existing dense large language models (LLMs), specifically the LLaMA-2 7B model. The authors aim to reduce the computational costs associated with training MoE models from scratch while maintaining or improving language capabilities. They propose two main steps: *Expert Construction* and *Continual Pre-training*. 1. **Expert Construction**: The feed-forward networks (FFNs) in the LLaMA model are partitioned into multiple experts. The authors explore different methods for this partitioning, including neuron-independent and neuron-sharing approaches. The neuron-independent methods involve randomly splitting the FFN into equal-sized sets or using k-means clustering. The neuron-sharing methods consider the importance of intermediate neurons and partition them based on their contribution to the model's performance. 2. **Continual Pre-training**: After constructing the MoE model, the authors continue training it with additional gate networks to recover its language modeling abilities. They investigate various data sampling strategies and data quality filtering methods to improve training efficiency and performance. The training objective is the same as the original LLaMA model, and the training budget is 200B tokens. The paper evaluates the effectiveness of the proposed LLaMA-MoE models on a variety of downstream tasks, demonstrating that LLaMA-MoE-3.5B models significantly outperform dense models with similar activation parameters. The authors also conduct ablation studies to understand the impact of different expert construction methods and data sampling strategies. They find that optimized static data sampling weights and data filtering can significantly improve performance. Overall, the paper contributes to the field by providing a framework for building MoE models from existing dense LLMs, reducing training costs while maintaining or improving language capabilities. The source code and models are available on GitHub.

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

24 Jun 2024 | Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, Yu Cheng