An Empirical Study of Mamba-based Language Models

An Empirical Study of Mamba-based Language Models

2024 | Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, Bryan Catanzaro
An empirical study compares Mamba-based and Transformer-based language models (LLMs) on large-scale datasets. The study evaluates 8B-parameter models trained on up to 3.5T tokens, including Mamba, Mamba-2, and a hybrid model combining Mamba-2, self-attention, and MLP layers (Mamba-2-Hybrid). The results show that while Mamba and Mamba-2 models match or exceed Transformers on many tasks, they lag behind on tasks requiring strong copying or long-context reasoning, such as MMLU and Phonebook Lookup. The Mamba-2-Hybrid model, however, outperforms Transformers on all 12 standard tasks, achieving an average improvement of 2.65 points. It is also predicted to be up to 8× faster during inference. The hybrid model also excels in long-context tasks, maintaining performance on 23 additional long-context benchmarks. The study highlights the hybrid model's ability to retrieve, track, and aggregate information over long contexts, though it still struggles with certain multi-document question-answering tasks. The hybrid model is also more robust to prompt formatting changes compared to Transformers. The study provides code and checkpoints for further research, emphasizing the potential of hybrid models to offer faster, more efficient inference without compromising accuracy.An empirical study compares Mamba-based and Transformer-based language models (LLMs) on large-scale datasets. The study evaluates 8B-parameter models trained on up to 3.5T tokens, including Mamba, Mamba-2, and a hybrid model combining Mamba-2, self-attention, and MLP layers (Mamba-2-Hybrid). The results show that while Mamba and Mamba-2 models match or exceed Transformers on many tasks, they lag behind on tasks requiring strong copying or long-context reasoning, such as MMLU and Phonebook Lookup. The Mamba-2-Hybrid model, however, outperforms Transformers on all 12 standard tasks, achieving an average improvement of 2.65 points. It is also predicted to be up to 8× faster during inference. The hybrid model also excels in long-context tasks, maintaining performance on 23 additional long-context benchmarks. The study highlights the hybrid model's ability to retrieve, track, and aggregate information over long contexts, though it still struggles with certain multi-document question-answering tasks. The hybrid model is also more robust to prompt formatting changes compared to Transformers. The study provides code and checkpoints for further research, emphasizing the potential of hybrid models to offer faster, more efficient inference without compromising accuracy.
Reach us at info@study.space
[slides] An Empirical Study of Mamba-based Language Models | StudySpace