Skywork-MoE is a high-performance mixture-of-experts (MoE) large language model with 146 billion parameters and 16 experts, initialized from the pre-trained Skywork-13B model. The model leverages techniques such as gating logit normalization and adaptive auxiliary loss coefficients to enhance expert diversification and optimize training efficiency. The study compares upcycling from dense model checkpoints with training from scratch, finding that the choice depends on the performance of the dense model and the training budget. Gating logit normalization improves the distribution of expert probabilities, while adaptive auxiliary loss coefficients allow for layer-specific adjustment of loss coefficients. The model was trained on a subset of the SkyPile corpus and demonstrated strong performance across various benchmarks. The study also explores the impact of training budgets and learning rate schedules on model performance, concluding that upcycling can be effective when the training budget is moderate. The model outperforms several open-source models in tasks such as Chinese language understanding, mathematical reasoning, and code generation. The research highlights the importance of expert diversification and efficient training techniques in the development of large-scale MoE models.Skywork-MoE is a high-performance mixture-of-experts (MoE) large language model with 146 billion parameters and 16 experts, initialized from the pre-trained Skywork-13B model. The model leverages techniques such as gating logit normalization and adaptive auxiliary loss coefficients to enhance expert diversification and optimize training efficiency. The study compares upcycling from dense model checkpoints with training from scratch, finding that the choice depends on the performance of the dense model and the training budget. Gating logit normalization improves the distribution of expert probabilities, while adaptive auxiliary loss coefficients allow for layer-specific adjustment of loss coefficients. The model was trained on a subset of the SkyPile corpus and demonstrated strong performance across various benchmarks. The study also explores the impact of training budgets and learning rate schedules on model performance, concluding that upcycling can be effective when the training budget is moderate. The model outperforms several open-source models in tasks such as Chinese language understanding, mathematical reasoning, and code generation. The research highlights the importance of expert diversification and efficient training techniques in the development of large-scale MoE models.