The paper introduces DeepSeekMoE, an innovative Mixture-of-Experts (MoE) architecture designed to achieve ultimate expert specialization. The architecture addresses the limitations of conventional MoE architectures, such as GShard, by employing two main strategies: fine-grained expert segmentation and shared expert isolation. Fine-grained expert segmentation involves dividing each Feed-Forward Network (FFN) into multiple smaller experts, allowing for more flexible combinations of activated experts and enhancing knowledge specialization. Shared expert isolation involves isolating a subset of experts as shared experts to capture common knowledge, reducing redundancy among other experts. The authors demonstrate that DeepSeekMoE 2B achieves comparable performance to GShard 2.9B with significantly fewer parameters and computation, and nearly matches the performance of dense models with the same number of parameters. They further scale DeepSeekMoE to 16B parameters, showing that it outperforms LLaMA2 7B with only about 40% of computations. Preliminary efforts to scale DeepSeekMoE to 145B parameters validate its substantial advantages over GShard and show comparable performance to DeepSeek 67B with only 28.5% of computations. The paper includes detailed experimental results, ablation studies, and analysis on expert specialization, providing strong evidence for the effectiveness of the proposed architecture.The paper introduces DeepSeekMoE, an innovative Mixture-of-Experts (MoE) architecture designed to achieve ultimate expert specialization. The architecture addresses the limitations of conventional MoE architectures, such as GShard, by employing two main strategies: fine-grained expert segmentation and shared expert isolation. Fine-grained expert segmentation involves dividing each Feed-Forward Network (FFN) into multiple smaller experts, allowing for more flexible combinations of activated experts and enhancing knowledge specialization. Shared expert isolation involves isolating a subset of experts as shared experts to capture common knowledge, reducing redundancy among other experts. The authors demonstrate that DeepSeekMoE 2B achieves comparable performance to GShard 2.9B with significantly fewer parameters and computation, and nearly matches the performance of dense models with the same number of parameters. They further scale DeepSeekMoE to 16B parameters, showing that it outperforms LLaMA2 7B with only about 40% of computations. Preliminary efforts to scale DeepSeekMoE to 145B parameters validate its substantial advantages over GShard and show comparable performance to DeepSeek 67B with only 28.5% of computations. The paper includes detailed experimental results, ablation studies, and analysis on expert specialization, providing strong evidence for the effectiveness of the proposed architecture.