Understanding JetMoE%3A Reaching Llama2 Performance with 0.1M Dollars

JetMoE-8B is an open-source Mixture-of-Experts (MoE) model trained with less than $0.1 million, achieving performance comparable to or better than larger models like Llama2-7B and Llama2-13B-Chat. It uses 1.25T tokens from mixed open-source data and 30,000 H100 GPU hours. The model employs a sparse activation approach in both attention and feed-forward layers, reducing inference computation by about 70% compared to Llama2-7B. JetMoE-8B is designed to be open and academia-friendly, using only public datasets and training code. It demonstrates that LLM training can be significantly more cost-effective than previously thought. The model is publicly available at https://github.com/myshell-ai/JetMoE. The architecture of JetMoE-8B is based on an efficient Sparsely-gated Mixture-of-Experts (SMoE) design, which allows for efficient parameter usage and reduces computational costs. The model is trained on a combination of real-world and synthetic datasets, including RefinedWeb, StarCoder, The Pile, and others. It is further fine-tuned using distilled supervised fine-tuning (dSFT) and distilled direct preference optimization (dDPO) to enhance performance. JetMoE-8B outperforms several open-source models on benchmark tasks, including the OpenLLM leaderboard and MT-Bench. The model is efficient, scalable, and accessible, making it a valuable contribution to the development of open-source, efficient, and high-performing language models.JetMoE-8B is an open-source Mixture-of-Experts (MoE) model trained with less than $0.1 million, achieving performance comparable to or better than larger models like Llama2-7B and Llama2-13B-Chat. It uses 1.25T tokens from mixed open-source data and 30,000 H100 GPU hours. The model employs a sparse activation approach in both attention and feed-forward layers, reducing inference computation by about 70% compared to Llama2-7B. JetMoE-8B is designed to be open and academia-friendly, using only public datasets and training code. It demonstrates that LLM training can be significantly more cost-effective than previously thought. The model is publicly available at https://github.com/myshell-ai/JetMoE. The architecture of JetMoE-8B is based on an efficient Sparsely-gated Mixture-of-Experts (SMoE) design, which allows for efficient parameter usage and reduces computational costs. The model is trained on a combination of real-world and synthetic datasets, including RefinedWeb, StarCoder, The Pile, and others. It is further fine-tuned using distilled supervised fine-tuning (dSFT) and distilled direct preference optimization (dDPO) to enhance performance. JetMoE-8B outperforms several open-source models on benchmark tasks, including the OpenLLM leaderboard and MT-Bench. The model is efficient, scalable, and accessible, making it a valuable contribution to the development of open-source, efficient, and high-performing language models.

JetMoE: Reaching Llama2 Performance with 0.1M Dollars

11 Apr 2024 | Yikang Shen, Zhen Guo, Tianle Cai, Zengyi Qin