Understanding SiMBA%3A Simplified Mamba-Based Architecture for Vision and Multivariate Time series

**SiMBA: Simplified Mamba-based Architecture for Vision and Multivariate Time Series** **Authors:** Badri N. Patro and Vijay S. Agneeswaran, Microsoft **Abstract:** Transformers have widely adopted attention networks for sequence mixing and MLPs for channel mixing, playing a pivotal role in achieving breakthroughs across domains. However, recent literature highlights issues with attention networks, including low inductive bias and quadratic complexity concerning input sequence length. State Space Models (SSMs) like S4 and others have emerged to address these issues, but they face challenges in handling long sequences. Mamba, while being the state-of-the-art SSM, has stability issues when scaled to large networks for computer vision datasets. We propose SiMBA, a new architecture that introduces Einstein FFT (EinFFT) for channel modeling by specific eigenvalue computations and uses the Mamba block for sequence modeling. Extensive performance studies across image and time-series benchmarks demonstrate that SiMBA outperforms existing SSMs, bridging the performance gap with state-of-the-art transformers. Notably, SiMBA establishes itself as the new state-of-the-art SSM on ImageNet and transfer learning benchmarks such as Stanford Car and Flower, as well as task learning benchmarks and seven time series benchmark datasets. **Keywords:** Transformer · Mamba · Spectral Channel Mixing · State Space Model **Introduction:** The evolution of language models has transitioned from Large Language Models (LLMs) to Small Language Models (SLMs). At the heart of both LLMs and SLMs lies the power of transformers, which exhibit scaling in both token and channel modeling. Traditional multi-headed self-attention (MHSA) in transformers poses computational challenges, particularly for longer sequences. To address this, SSMs like S4 leverage state-space-based sequential modeling, offering enhanced efficiency and performance for processing long input sequences. However, S4 and other SSMs face challenges in handling information-dense data, especially in domains like computer vision and genomic data. Mamba addresses these limitations by incorporating the current token in the state space, achieving in-context learning. However, Mamba has stability issues when scaled to large networks, leading to vanishing/exploding gradients. **SiMBA:** SiMBA introduces EinFFT, a novel technique for channel modeling that uses Fourier transforms with non-linear activation functions to ensure eigenvalues are negative real numbers, solving the stability issue in Mamba. SiMBA also incorporates residual connections and dropouts to enhance stability. The architectural investigation explores diverse state space models and attention models, as well as various channel modeling alternatives. Through extensive experimentation, SiMBA is identified as the most efficient and streamlined state space architecture, closing the performance gap with state-of-the-art attention-based transformers on ImageNet and six standard time series datasets. **Method:** SiMBA's channel mixing component (EinFFT) includes spectral transformation, spectral gating network using Einstein**SiMBA: Simplified Mamba-based Architecture for Vision and Multivariate Time Series** **Authors:** Badri N. Patro and Vijay S. Agneeswaran, Microsoft **Abstract:** Transformers have widely adopted attention networks for sequence mixing and MLPs for channel mixing, playing a pivotal role in achieving breakthroughs across domains. However, recent literature highlights issues with attention networks, including low inductive bias and quadratic complexity concerning input sequence length. State Space Models (SSMs) like S4 and others have emerged to address these issues, but they face challenges in handling long sequences. Mamba, while being the state-of-the-art SSM, has stability issues when scaled to large networks for computer vision datasets. We propose SiMBA, a new architecture that introduces Einstein FFT (EinFFT) for channel modeling by specific eigenvalue computations and uses the Mamba block for sequence modeling. Extensive performance studies across image and time-series benchmarks demonstrate that SiMBA outperforms existing SSMs, bridging the performance gap with state-of-the-art transformers. Notably, SiMBA establishes itself as the new state-of-the-art SSM on ImageNet and transfer learning benchmarks such as Stanford Car and Flower, as well as task learning benchmarks and seven time series benchmark datasets. **Keywords:** Transformer · Mamba · Spectral Channel Mixing · State Space Model **Introduction:** The evolution of language models has transitioned from Large Language Models (LLMs) to Small Language Models (SLMs). At the heart of both LLMs and SLMs lies the power of transformers, which exhibit scaling in both token and channel modeling. Traditional multi-headed self-attention (MHSA) in transformers poses computational challenges, particularly for longer sequences. To address this, SSMs like S4 leverage state-space-based sequential modeling, offering enhanced efficiency and performance for processing long input sequences. However, S4 and other SSMs face challenges in handling information-dense data, especially in domains like computer vision and genomic data. Mamba addresses these limitations by incorporating the current token in the state space, achieving in-context learning. However, Mamba has stability issues when scaled to large networks, leading to vanishing/exploding gradients. **SiMBA:** SiMBA introduces EinFFT, a novel technique for channel modeling that uses Fourier transforms with non-linear activation functions to ensure eigenvalues are negative real numbers, solving the stability issue in Mamba. SiMBA also incorporates residual connections and dropouts to enhance stability. The architectural investigation explores diverse state space models and attention models, as well as various channel modeling alternatives. Through extensive experimentation, SiMBA is identified as the most efficient and streamlined state space architecture, closing the performance gap with state-of-the-art attention-based transformers on ImageNet and six standard time series datasets. **Method:** SiMBA's channel mixing component (EinFFT) includes spectral transformation, spectral gating network using Einstein

SiMBA: Simplified Mamba-based Architecture for Vision and Multivariate Time series

24 Apr 2024 | Badri N. Patro and Vijay S, Agneeswaran