23 Jul 2024 | Ali Behrouz, Michele Santacatterina, Ramin Zabih
**MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection**
**Authors:** Ali Behrouz
**Project Page:** [Code & Models]
**Abstract:**
Recent advancements in deep learning have heavily relied on Transformers due to their data dependency and scalability. However, the attention modules in these architectures exhibit quadratic time and space complexity in input size, limiting their scalability for long-sequence modeling. To address this, State Space Models (SSMs), particularly Selective State Space Models (S6), have shown promising potential for long sequence modeling. Motivated by the success of SSMs, the authors present MambaMixer, a new architecture that uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer. MambaMixer further connects sequential selective mixers using a weighted averaging mechanism, allowing layers to access different layers' inputs and outputs directly. As a proof of concept, the authors design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on MambaMixer and evaluate their performance in various vision and time series forecasting tasks. The results demonstrate the importance of selectively mixing across both tokens and channels. In ImageNet classification, object detection, and semantic segmentation tasks, ViM2 achieves competitive performance with established vision models and outperforms SSM-based vision models. In time series forecasting, TSM2, an attention and MLP-free architecture, achieves outstanding performance compared to state-of-the-art methods while showing significantly improved computational efficiency.
**Contributions:**
1. Presenting MambaMixer, a new SSM-based architecture with dual selection for efficient and effective mixing and filtering of informative and irrelevant tokens and channels.
2. Demonstrating the effectiveness of bidirectional S6 blocks in focusing or ignoring specific channels.
3. Enhancing information flow in multi-layer MambaMixer-based architectures using a weighted averaging mechanism.
4. Presenting ViM2 and TSM2 models for vision and time series forecasting tasks, achieving superior performance compared to baselines.
**Related Work:**
The paper discusses related studies in sequence modeling, architectures for generic vision backbones, and architectures for generic time series backbones. It highlights the limitations of existing methods and the advantages of MambaMixer in handling non-causal and multi-dimensional data.
**Model: MambaMixer**
MambaMixer is designed to efficiently select and mix (or filter) informative (or irrelevant) tokens and channels. It uses a Selective Token Mixer and a Selective Channel Mixer, each consisting of a bidirectional S6 block. The weighted averaging mechanism allows direct access to earlier features, enhancing information flow and stability.
**Computational Complexity:**
MambaMixer has linear time and space complexity with respect to the sequence length and number of channels, addressing the quadratic complexity of Transformers.
**Vision MambaMixer (ViM2)**
ViM2 is adapted**MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection**
**Authors:** Ali Behrouz
**Project Page:** [Code & Models]
**Abstract:**
Recent advancements in deep learning have heavily relied on Transformers due to their data dependency and scalability. However, the attention modules in these architectures exhibit quadratic time and space complexity in input size, limiting their scalability for long-sequence modeling. To address this, State Space Models (SSMs), particularly Selective State Space Models (S6), have shown promising potential for long sequence modeling. Motivated by the success of SSMs, the authors present MambaMixer, a new architecture that uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer. MambaMixer further connects sequential selective mixers using a weighted averaging mechanism, allowing layers to access different layers' inputs and outputs directly. As a proof of concept, the authors design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on MambaMixer and evaluate their performance in various vision and time series forecasting tasks. The results demonstrate the importance of selectively mixing across both tokens and channels. In ImageNet classification, object detection, and semantic segmentation tasks, ViM2 achieves competitive performance with established vision models and outperforms SSM-based vision models. In time series forecasting, TSM2, an attention and MLP-free architecture, achieves outstanding performance compared to state-of-the-art methods while showing significantly improved computational efficiency.
**Contributions:**
1. Presenting MambaMixer, a new SSM-based architecture with dual selection for efficient and effective mixing and filtering of informative and irrelevant tokens and channels.
2. Demonstrating the effectiveness of bidirectional S6 blocks in focusing or ignoring specific channels.
3. Enhancing information flow in multi-layer MambaMixer-based architectures using a weighted averaging mechanism.
4. Presenting ViM2 and TSM2 models for vision and time series forecasting tasks, achieving superior performance compared to baselines.
**Related Work:**
The paper discusses related studies in sequence modeling, architectures for generic vision backbones, and architectures for generic time series backbones. It highlights the limitations of existing methods and the advantages of MambaMixer in handling non-causal and multi-dimensional data.
**Model: MambaMixer**
MambaMixer is designed to efficiently select and mix (or filter) informative (or irrelevant) tokens and channels. It uses a Selective Token Mixer and a Selective Channel Mixer, each consisting of a bidirectional S6 block. The weighted averaging mechanism allows direct access to earlier features, enhancing information flow and stability.
**Computational Complexity:**
MambaMixer has linear time and space complexity with respect to the sequence length and number of channels, addressing the quadratic complexity of Transformers.
**Vision MambaMixer (ViM2)**
ViM2 is adapted