Understanding Jamba%3A A Hybrid Transformer-Mamba Language Model

Jamba is a novel large language model that combines Transformer layers with Mamba layers, a state-space model, and a mixture-of-experts (MoE) module. This hybrid architecture aims to balance memory usage, throughput, and performance, particularly for long contexts. The model is designed to fit in a single 80GB GPU while maintaining state-of-the-art performance on various benchmarks and supporting up to 256K tokens of context. Key features include: 1. **Hybrid Architecture**: Jamba interleaves Transformer and Mamba layers, leveraging the strengths of both models. 2. **Mixture-of-Experts (MoE)**: MoE is applied to some layers to increase model capacity while keeping active parameter usage manageable. 3. **Memory Efficiency**: Jamba reduces the KV cache memory requirements compared to vanilla Transformers, making it suitable for long contexts. 4. **Throughput**: Jamba achieves high throughput, especially for long sequences, with a 3x improvement over Mixtral-8x7B in certain scenarios. 5. **Evaluation**: Jamba performs similarly to leading models like Llama-2 70B and Mixtral on academic benchmarks while offering better throughput. 6. **Long-Context Performance**: Jamba excels in long-context evaluations, outperforming Mixtral on most datasets and achieving better throughput. The model's architecture and design choices are detailed, including the ratio of attention-to-Mamba layers, the frequency of using MoE, and the number of experts per layer. Ablation studies highlight the benefits of combining attention and Mamba layers, the effectiveness of MoE, and the necessity of explicit positional information. The release of Jamba includes model checkpoints from various ablation runs to encourage further research.Jamba is a novel large language model that combines Transformer layers with Mamba layers, a state-space model, and a mixture-of-experts (MoE) module. This hybrid architecture aims to balance memory usage, throughput, and performance, particularly for long contexts. The model is designed to fit in a single 80GB GPU while maintaining state-of-the-art performance on various benchmarks and supporting up to 256K tokens of context. Key features include: 1. **Hybrid Architecture**: Jamba interleaves Transformer and Mamba layers, leveraging the strengths of both models. 2. **Mixture-of-Experts (MoE)**: MoE is applied to some layers to increase model capacity while keeping active parameter usage manageable. 3. **Memory Efficiency**: Jamba reduces the KV cache memory requirements compared to vanilla Transformers, making it suitable for long contexts. 4. **Throughput**: Jamba achieves high throughput, especially for long sequences, with a 3x improvement over Mixtral-8x7B in certain scenarios. 5. **Evaluation**: Jamba performs similarly to leading models like Llama-2 70B and Mixtral on academic benchmarks while offering better throughput. 6. **Long-Context Performance**: Jamba excels in long-context evaluations, outperforming Mixtral on most datasets and achieving better throughput. The model's architecture and design choices are detailed, including the ratio of attention-to-Mamba layers, the frequency of using MoE, and the number of experts per layer. Ablation studies highlight the benefits of combining attention and Mamba layers, the effectiveness of MoE, and the necessity of explicit positional information. The release of Jamba includes model checkpoints from various ablation runs to encourage further research.

Jamba: A Hybrid Transformer-Mamba Language Model