Understanding Mamba%3A Linear-Time Sequence Modeling with Selective State Spaces

The paper introduces Mamba, a new class of selective state space models (SSMs) designed to improve the efficiency and effectiveness of sequence modeling. Mamba addresses the limitations of existing SSMs, particularly their inability to perform content-based reasoning, by allowing the SSM parameters to be functions of the input. This enables the model to selectively propagate or forget information along the sequence length dimension, depending on the current token. The authors also propose a hardware-aware parallel algorithm for efficient computation in recurrent mode, overcoming the efficiency constraints of previous SSMs. Mamba is integrated into a simplified end-to-end neural network architecture, achieving fast inference (5× higher throughput than Transformers) and linear scaling in sequence length. Empirical evaluations on synthetic tasks, language modeling, DNA sequence modeling, and audio waveform modeling demonstrate Mamba's superior performance, especially on long sequences up to 1 million tokens. Mamba outperforms or matches the performance of large Transformers in pretraining and downstream tasks, making it a promising backbone for foundation models across various modalities.The paper introduces Mamba, a new class of selective state space models (SSMs) designed to improve the efficiency and effectiveness of sequence modeling. Mamba addresses the limitations of existing SSMs, particularly their inability to perform content-based reasoning, by allowing the SSM parameters to be functions of the input. This enables the model to selectively propagate or forget information along the sequence length dimension, depending on the current token. The authors also propose a hardware-aware parallel algorithm for efficient computation in recurrent mode, overcoming the efficiency constraints of previous SSMs. Mamba is integrated into a simplified end-to-end neural network architecture, achieving fast inference (5× higher throughput than Transformers) and linear scaling in sequence length. Empirical evaluations on synthetic tasks, language modeling, DNA sequence modeling, and audio waveform modeling demonstrate Mamba's superior performance, especially on long sequences up to 1 million tokens. Mamba outperforms or matches the performance of large Transformers in pretraining and downstream tasks, making it a promising backbone for foundation models across various modalities.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

31 May 2024 | Albert Gu* and Tri Dao*