[slides and audio] Mixtral of Experts

Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model introduced by a team of researchers. The model shares the same architecture as Mistral 7B but each layer consists of 8 feedforward blocks (experts). A router network selects two experts to process each token, combining their outputs. This approach allows Mixtral to access 47 billion parameters while using only 13 billion active parameters during inference. Trained with a context size of 32k tokens, Mixtral outperforms or matches Llama 2 70B and GPT-3.5 across various benchmarks, particularly excelling in mathematics, code generation, and multilingual tasks. The model also includes a fine-tuned version, Mixtral 8x7B – Instruct, which surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B in human evaluation benchmarks. Both the base and fine-tuned models are released under the Apache 2.0 license. The paper discusses the architectural details, performance comparisons, and bias analysis, highlighting Mixtral's efficiency and effectiveness in handling long contexts and diverse tasks.Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model introduced by a team of researchers. The model shares the same architecture as Mistral 7B but each layer consists of 8 feedforward blocks (experts). A router network selects two experts to process each token, combining their outputs. This approach allows Mixtral to access 47 billion parameters while using only 13 billion active parameters during inference. Trained with a context size of 32k tokens, Mixtral outperforms or matches Llama 2 70B and GPT-3.5 across various benchmarks, particularly excelling in mathematics, code generation, and multilingual tasks. The model also includes a fine-tuned version, Mixtral 8x7B – Instruct, which surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B in human evaluation benchmarks. Both the base and fine-tuned models are released under the Apache 2.0 license. The paper discusses the architectural details, performance comparisons, and bias analysis, highlighting Mixtral's efficiency and effectiveness in handling long contexts and diverse tasks.