8 Jan 2024 | Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed
Mixtral 8x7B is a sparse mixture of experts (SMoE) language model developed by Mistral AI. It is based on the same architecture as Mistral 7B, with each layer containing 8 feedforward blocks (experts). At each layer, a router network selects two experts to process each token and combine their outputs. This allows Mixtral to access 47B parameters but only use 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and outperforms or matches Llama 2 70B and GPT-3.5 across various benchmarks, particularly excelling in mathematics, code generation, and multilingual tasks. A fine-tuned version, Mixtral 8x7B – Instruct, surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B – chat on human benchmarks and shows reduced bias and more balanced sentiment. Both models are released under the Apache 2.0 license.
Mixtral is a sparse mixture-of-experts network, a decoder-only model where each token is processed by two experts selected by a router. This approach increases model parameters while keeping computational cost low. Mixtral is pretrained on multilingual data with a context size of 32k tokens and performs well on various benchmarks. It demonstrates superior capabilities in mathematics, code generation, and multilingual understanding, significantly outperforming Llama 2 70B in these domains. Mixtral can retrieve information from its 32k token context window regardless of sequence length or information location.
Mixtral 8x7B – Instruct is a chat model fine-tuned to follow instructions using supervised fine-tuning and Direct Preference Optimization. It outperforms other models on human evaluation benchmarks and shows reduced bias and more balanced sentiment. Both models are released under the Apache 2.0 license.
Mixtral's architecture is based on a transformer model with modifications, including the use of Mixture-of-Experts layers. The model uses a sparse parameter count of 47B but only activates 13B parameters per token. This allows efficient inference with lower computational costs. Mixtral's MoE layers are efficient on single GPUs and can be distributed across multiple GPUs for parallel processing.
Mixtral outperforms Llama 2 70B on most benchmarks, using 5x fewer active parameters. It performs well on multilingual benchmarks, outperforming Llama 2 70B in several languages. Mixtral also excels in long context tasks, achieving 100% retrieval accuracy on the passkey task regardless of context length or position. It shows reduced bias and more positive sentiment on bias benchmarks compared to Llama 2 70B.
Mixtral 8x7Mixtral 8x7B is a sparse mixture of experts (SMoE) language model developed by Mistral AI. It is based on the same architecture as Mistral 7B, with each layer containing 8 feedforward blocks (experts). At each layer, a router network selects two experts to process each token and combine their outputs. This allows Mixtral to access 47B parameters but only use 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and outperforms or matches Llama 2 70B and GPT-3.5 across various benchmarks, particularly excelling in mathematics, code generation, and multilingual tasks. A fine-tuned version, Mixtral 8x7B – Instruct, surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B – chat on human benchmarks and shows reduced bias and more balanced sentiment. Both models are released under the Apache 2.0 license.
Mixtral is a sparse mixture-of-experts network, a decoder-only model where each token is processed by two experts selected by a router. This approach increases model parameters while keeping computational cost low. Mixtral is pretrained on multilingual data with a context size of 32k tokens and performs well on various benchmarks. It demonstrates superior capabilities in mathematics, code generation, and multilingual understanding, significantly outperforming Llama 2 70B in these domains. Mixtral can retrieve information from its 32k token context window regardless of sequence length or information location.
Mixtral 8x7B – Instruct is a chat model fine-tuned to follow instructions using supervised fine-tuning and Direct Preference Optimization. It outperforms other models on human evaluation benchmarks and shows reduced bias and more balanced sentiment. Both models are released under the Apache 2.0 license.
Mixtral's architecture is based on a transformer model with modifications, including the use of Mixture-of-Experts layers. The model uses a sparse parameter count of 47B but only activates 13B parameters per token. This allows efficient inference with lower computational costs. Mixtral's MoE layers are efficient on single GPUs and can be distributed across multiple GPUs for parallel processing.
Mixtral outperforms Llama 2 70B on most benchmarks, using 5x fewer active parameters. It performs well on multilingual benchmarks, outperforming Llama 2 70B in several languages. Mixtral also excels in long context tasks, achieving 100% retrieval accuracy on the passkey task regardless of context length or position. It shows reduced bias and more positive sentiment on bias benchmarks compared to Llama 2 70B.
Mixtral 8x7