LLaMA-MoE is a mixture-of-experts (MoE) model built from the LLaMA-2 7B model through expert construction and continual pre-training. The model partitions the Feed-Forward Networks (FFNs) into multiple experts and trains the MoE model along with additional gate networks. The study explores various expert construction methods and data sampling strategies for continual pre-training. The LLaMA-MoE-3.5B model, trained on 200B tokens, significantly outperforms dense models with similar activation parameters. The model maintains language abilities and routes input tokens to specific experts. The study also investigates different expert construction methods, including neuron-independent and neuron-sharing approaches, and evaluates the effectiveness of data sampling strategies. The results show that the LLaMA-MoE model achieves strong performance on various downstream tasks, demonstrating the effectiveness of the proposed approach. The model is transparent in its construction and training data. The study also compares the performance of LLaMA-MoE with other pre-trained language models and finds that it outperforms them in most tasks. The results indicate that building MoE models from existing dense models is a viable approach for improving model performance while reducing computational costs.LLaMA-MoE is a mixture-of-experts (MoE) model built from the LLaMA-2 7B model through expert construction and continual pre-training. The model partitions the Feed-Forward Networks (FFNs) into multiple experts and trains the MoE model along with additional gate networks. The study explores various expert construction methods and data sampling strategies for continual pre-training. The LLaMA-MoE-3.5B model, trained on 200B tokens, significantly outperforms dense models with similar activation parameters. The model maintains language abilities and routes input tokens to specific experts. The study also investigates different expert construction methods, including neuron-independent and neuron-sharing approaches, and evaluates the effectiveness of data sampling strategies. The results show that the LLaMA-MoE model achieves strong performance on various downstream tasks, demonstrating the effectiveness of the proposed approach. The model is transparent in its construction and training data. The study also compares the performance of LLaMA-MoE with other pre-trained language models and finds that it outperforms them in most tasks. The results indicate that building MoE models from existing dense models is a viable approach for improving model performance while reducing computational costs.