[slides and audio] VL-Mamba%3A Exploring State Space Models for Multimodal Learning

VL-Mamba is a multimodal large language model (MLLM) that leverages state space models (SSMs) to address the computational overhead and quadratic complexity issues inherent in Transformer-based models. The authors propose VL-Mamba, which uses the pre-trained Mamba language model as its backbone, replacing traditional Transformer-based language models like LLaMA or Vicuna. They explore the application of 2D vision selective scan mechanisms and introduce a MultiModal Connector (MMC) architecture, including a Vision Selective Scan (VSS) module, to enhance the 2D-causal modeling of visual sequences. The MMC consists of two variants: VSS-MLP and VSS-L2, and two 2D scan mechanisms: Bidirectional-Scan Mechanism (BSM) and Cross-Scan Mechanism (CSM). Extensive experiments on various multimodal benchmarks demonstrate the effectiveness of VL-Mamba, showing competitive performance and outperforming large MLLMs on some benchmarks. The paper also includes ablation studies to evaluate the impact of different components, such as language model variants, vision encoders, MMC architectures, and scan mechanisms. The results highlight the potential of SSMs in multimodal learning tasks.VL-Mamba is a multimodal large language model (MLLM) that leverages state space models (SSMs) to address the computational overhead and quadratic complexity issues inherent in Transformer-based models. The authors propose VL-Mamba, which uses the pre-trained Mamba language model as its backbone, replacing traditional Transformer-based language models like LLaMA or Vicuna. They explore the application of 2D vision selective scan mechanisms and introduce a MultiModal Connector (MMC) architecture, including a Vision Selective Scan (VSS) module, to enhance the 2D-causal modeling of visual sequences. The MMC consists of two variants: VSS-MLP and VSS-L2, and two 2D scan mechanisms: Bidirectional-Scan Mechanism (BSM) and Cross-Scan Mechanism (CSM). Extensive experiments on various multimodal benchmarks demonstrate the effectiveness of VL-Mamba, showing competitive performance and outperforming large MLLMs on some benchmarks. The paper also includes ablation studies to evaluate the impact of different components, such as language model variants, vision encoders, MMC architectures, and scan mechanisms. The results highlight the potential of SSMs in multimodal learning tasks.

VL-Mamba: Exploring State Space Models for Multimodal Learning

20 Mar 2024 | Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, Qi Wu, Jing Liu