Mixtral of Experts

Mixtral of Experts

8 Jan 2024 | Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed
Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model introduced by a team of researchers. The model shares the same architecture as Mistral 7B but each layer consists of 8 feedforward blocks (experts). A router network selects two experts to process each token, combining their outputs. This approach allows Mixtral to access 47 billion parameters while using only 13 billion active parameters during inference. Trained with a context size of 32k tokens, Mixtral outperforms or matches Llama 2 70B and GPT-3.5 across various benchmarks, particularly excelling in mathematics, code generation, and multilingual tasks. The model also includes a fine-tuned version, Mixtral 8x7B – Instruct, which surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B in human evaluation benchmarks. Both the base and fine-tuned models are released under the Apache 2.0 license. The paper discusses the architectural details, performance comparisons, and bias analysis, highlighting Mixtral's efficiency and effectiveness in handling long contexts and diverse tasks.Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model introduced by a team of researchers. The model shares the same architecture as Mistral 7B but each layer consists of 8 feedforward blocks (experts). A router network selects two experts to process each token, combining their outputs. This approach allows Mixtral to access 47 billion parameters while using only 13 billion active parameters during inference. Trained with a context size of 32k tokens, Mixtral outperforms or matches Llama 2 70B and GPT-3.5 across various benchmarks, particularly excelling in mathematics, code generation, and multilingual tasks. The model also includes a fine-tuned version, Mixtral 8x7B – Instruct, which surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B in human evaluation benchmarks. Both the base and fine-tuned models are released under the Apache 2.0 license. The paper discusses the architectural details, performance comparisons, and bias analysis, highlighting Mixtral's efficiency and effectiveness in handling long contexts and diverse tasks.
Reach us at info@study.space
[slides] Mixtral of Experts | StudySpace