12 Jun 2024 | Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, Bryan Catanzaro
This paper presents a comprehensive empirical study comparing Mamba-based language models with Transformer models, focusing on their performance on large-scale datasets. The study aims to understand the strengths and weaknesses of Mamba models, which are selective state-space models (SSMs) designed to overcome the computational and memory limitations of Transformers. The research involves training 8B-parameter Mamba, Mamba-2, and Transformer models on datasets up to 3.5T tokens and evaluating them on a diverse set of natural language tasks.
Key findings include:
1. **Performance on Standard Tasks**: Mamba and Mamba-2 models match or exceed Transformers on many standard tasks, but lag behind on tasks requiring strong copying or in-context learning, such as five-shot MMLU and Phonebook Lookup.
2. **Hybrid Models**: An 8B-parameter hybrid model combining Mamba-2, self-attention, and MLP layers (Mamba-2-Hybrid) outperforms both pure Mamba-2 and Transformers on all 12 standard tasks, achieving an average improvement of 2.65 points.
3. **Inference Speed**: Mamba-2-Hybrid is significantly faster than Transformers at inference time, with an estimated 8× speedup for long-context reasoning.
4. **Long-Context Capabilities**: The hybrid model continues to perform well on long-context tasks, showing no degradation compared to shorter context variants. However, it faces challenges on multi-document question answering tasks.
5. **Synthetic Long-Context Tasks**: Mamba-2-Hybrid demonstrates superior performance on synthetic long-context tasks, particularly in recalling, tracking, and aggregating information across long inputs.
The study concludes that while pure SSM models have limitations in certain tasks, hybrid models combining SSM and Transformer layers offer improved performance and efficiency, making them a promising alternative to Transformers. The results are released as part of NVIDIA’s Megatron-LM project to facilitate further research and adoption.This paper presents a comprehensive empirical study comparing Mamba-based language models with Transformer models, focusing on their performance on large-scale datasets. The study aims to understand the strengths and weaknesses of Mamba models, which are selective state-space models (SSMs) designed to overcome the computational and memory limitations of Transformers. The research involves training 8B-parameter Mamba, Mamba-2, and Transformer models on datasets up to 3.5T tokens and evaluating them on a diverse set of natural language tasks.
Key findings include:
1. **Performance on Standard Tasks**: Mamba and Mamba-2 models match or exceed Transformers on many standard tasks, but lag behind on tasks requiring strong copying or in-context learning, such as five-shot MMLU and Phonebook Lookup.
2. **Hybrid Models**: An 8B-parameter hybrid model combining Mamba-2, self-attention, and MLP layers (Mamba-2-Hybrid) outperforms both pure Mamba-2 and Transformers on all 12 standard tasks, achieving an average improvement of 2.65 points.
3. **Inference Speed**: Mamba-2-Hybrid is significantly faster than Transformers at inference time, with an estimated 8× speedup for long-context reasoning.
4. **Long-Context Capabilities**: The hybrid model continues to perform well on long-context tasks, showing no degradation compared to shorter context variants. However, it faces challenges on multi-document question answering tasks.
5. **Synthetic Long-Context Tasks**: Mamba-2-Hybrid demonstrates superior performance on synthetic long-context tasks, particularly in recalling, tracking, and aggregating information across long inputs.
The study concludes that while pure SSM models have limitations in certain tasks, hybrid models combining SSM and Transformer layers offer improved performance and efficiency, making them a promising alternative to Transformers. The results are released as part of NVIDIA’s Megatron-LM project to facilitate further research and adoption.