Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

28 Feb 2025 | Liliang Ren1,2*, Yang Liu1†, Yadong Lu1†, Yelong Shen1, Chen Liang1, Weizhu Chen1
SAMBA is a novel hybrid neural architecture designed for efficient language modeling with unlimited context length. It combines the strengths of State Space Models (SSMs) and attention mechanisms, achieving linear time complexity and superior performance on various benchmarks. SAMBA layers, including Mamba, Sliding Window Attention (SWA), and Multi-Layer Perceptrons (MLPs), are interleaved to capture recurrent structures and precise memory retrieval. The architecture is trained with up to 3.8 billion parameters and 3.2 trillion tokens, outperforming state-of-the-art models in tasks such as commonsense reasoning, language understanding, and long-context summarization. SAMBA demonstrates efficient length extrapolation, achieving perfect memory recall on challenging tasks like Passkey Retrieval and Phonebook, and exhibits higher throughput compared to Transformers with grouped-query attention. The study also explores the benefits of hybridizing attention and linear recurrence, providing insights into optimal training configurations and the effectiveness of Mamba's input selection mechanism.SAMBA is a novel hybrid neural architecture designed for efficient language modeling with unlimited context length. It combines the strengths of State Space Models (SSMs) and attention mechanisms, achieving linear time complexity and superior performance on various benchmarks. SAMBA layers, including Mamba, Sliding Window Attention (SWA), and Multi-Layer Perceptrons (MLPs), are interleaved to capture recurrent structures and precise memory retrieval. The architecture is trained with up to 3.8 billion parameters and 3.2 trillion tokens, outperforming state-of-the-art models in tasks such as commonsense reasoning, language understanding, and long-context summarization. SAMBA demonstrates efficient length extrapolation, achieving perfect memory recall on challenging tasks like Passkey Retrieval and Phonebook, and exhibits higher throughput compared to Transformers with grouped-query attention. The study also explores the benefits of hybridizing attention and linear recurrence, providing insights into optimal training configurations and the effectiveness of Mamba's input selection mechanism.
Reach us at info@study.space