Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data
**Abstract:**
Transformers have become the standard architecture for sequence modeling, but their self-attention layers scale quadratically with sequence length. Mamba, a state space model (SSM), achieves comparable performance to Transformers while scaling linearly. This work presents Mamba-ND, which extends Mamba to arbitrary multi-dimensional data. Mamba-ND alternately processes the input data across different dimensions in row-major orderings. Extensive experiments show that Mamba-ND outperforms Transformers on various multi-dimensional benchmarks, including ImageNet-1K, HMDB-51, UCF-101, ERA5, and BTCV, with significantly fewer parameters and subquadratic complexity.
**Keywords:**
State Space Models · Multi-Dimensional Modeling
- **Introduction:**
- **Background on SSMs:** SSMs model input data using ordinary differential equations (ODEs) and have shown superior performance on long sequences.
- **Mamba Layer:** Mamba layers consist of a 1D convolution, an SSM kernel, and a residual connection.
- **Methodology:** Various approaches to adapt Mamba to multidimensional data are explored, including layer-level and block-level designs.
- **Scan Orderings:**
- **Definition:** Scan orderings are permutations of the axes of the input data flattened into a 1D sequence.
- **Examples:** For 2D data, there are four possible orderings: $(HW) +$, $(HW) -$, $(WH) +$, and $(WH) -$.
- **3D Data:** There are 12 possible orderings, such as $(HWT) +$ and $(WHT) -$.
- **Adapting the Mamba Layer:**
- **Bi-SSM Layer:** Passes the output of the convolution layer to two independent SSM kernels.
- **ND-SSM Layer:** Extends Bi-SSM by incorporating additional SSMs for different orderings.
- **Multi-head SSM Layer:** Splits the input sequence into multiple heads, each processed by a separate SSM kernel.
- **Arranging Mamba Layers:**
- **Alternating-Directional:** Changes the direction of SSM in each layer alternately.
- **Bi-Directional:** Processes the input in opposite directions in each layer.
- **Quad-Directional:** Further groups the directions to improve performance.
- **Scan Factorization:**
- **Purpose:** To mitigate the quadratic complexity of Transformers, various ways to factorize the SSM scan into smaller scans are explored.
- **Experiments:**
- **Datasets and Setups:** ImageNet-1K, HMDB-51, UCF-101, ERA5, and BTCV are used for evaluation.
-Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data
**Abstract:**
Transformers have become the standard architecture for sequence modeling, but their self-attention layers scale quadratically with sequence length. Mamba, a state space model (SSM), achieves comparable performance to Transformers while scaling linearly. This work presents Mamba-ND, which extends Mamba to arbitrary multi-dimensional data. Mamba-ND alternately processes the input data across different dimensions in row-major orderings. Extensive experiments show that Mamba-ND outperforms Transformers on various multi-dimensional benchmarks, including ImageNet-1K, HMDB-51, UCF-101, ERA5, and BTCV, with significantly fewer parameters and subquadratic complexity.
**Keywords:**
State Space Models · Multi-Dimensional Modeling
- **Introduction:**
- **Background on SSMs:** SSMs model input data using ordinary differential equations (ODEs) and have shown superior performance on long sequences.
- **Mamba Layer:** Mamba layers consist of a 1D convolution, an SSM kernel, and a residual connection.
- **Methodology:** Various approaches to adapt Mamba to multidimensional data are explored, including layer-level and block-level designs.
- **Scan Orderings:**
- **Definition:** Scan orderings are permutations of the axes of the input data flattened into a 1D sequence.
- **Examples:** For 2D data, there are four possible orderings: $(HW) +$, $(HW) -$, $(WH) +$, and $(WH) -$.
- **3D Data:** There are 12 possible orderings, such as $(HWT) +$ and $(WHT) -$.
- **Adapting the Mamba Layer:**
- **Bi-SSM Layer:** Passes the output of the convolution layer to two independent SSM kernels.
- **ND-SSM Layer:** Extends Bi-SSM by incorporating additional SSMs for different orderings.
- **Multi-head SSM Layer:** Splits the input sequence into multiple heads, each processed by a separate SSM kernel.
- **Arranging Mamba Layers:**
- **Alternating-Directional:** Changes the direction of SSM in each layer alternately.
- **Bi-Directional:** Processes the input in opposite directions in each layer.
- **Quad-Directional:** Further groups the directions to improve performance.
- **Scan Factorization:**
- **Purpose:** To mitigate the quadratic complexity of Transformers, various ways to factorize the SSM scan into smaller scans are explored.
- **Experiments:**
- **Datasets and Setups:** ImageNet-1K, HMDB-51, UCF-101, ERA5, and BTCV are used for evaluation.
-