VL-Mamba: Exploring State Space Models for Multimodal Learning

VL-Mamba: Exploring State Space Models for Multimodal Learning

20 Mar 2024 | Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, Qi Wu, Jing Liu
VL-Mamba is a multimodal large language model based on state space models (SSMs), designed to address the computational inefficiency of traditional Transformer-based models in long-sequence tasks. The paper introduces VL-Mamba, which replaces the traditional Transformer-based language model with a pre-trained Mamba model, and integrates a 2D vision selective scan mechanism to enhance visual processing. The model also incorporates a novel MultiModal Connector (MMC) with a Vision Selective Scan (VSS) module to improve the representation of 2D visual sequences. The VSS module includes two scan mechanisms: Bidirectional-Scan Mechanism (BSM) and Cross-Scan Mechanism (CSM), which enable efficient processing of visual data. The model is evaluated on various multimodal benchmarks, demonstrating competitive performance with other large multimodal language models. The results show that VL-Mamba achieves strong performance on tasks such as visual question answering, image captioning, and visual reasoning, with a significant reduction in computational complexity compared to traditional models. The paper also conducts ablation studies to evaluate the effectiveness of different components, including language models, vision encoders, and MMC architectures. The findings indicate that the VSS module significantly improves performance, especially in tasks requiring 2D visual processing. The model is open-sourced to promote further research into state space models for multimodal learning.VL-Mamba is a multimodal large language model based on state space models (SSMs), designed to address the computational inefficiency of traditional Transformer-based models in long-sequence tasks. The paper introduces VL-Mamba, which replaces the traditional Transformer-based language model with a pre-trained Mamba model, and integrates a 2D vision selective scan mechanism to enhance visual processing. The model also incorporates a novel MultiModal Connector (MMC) with a Vision Selective Scan (VSS) module to improve the representation of 2D visual sequences. The VSS module includes two scan mechanisms: Bidirectional-Scan Mechanism (BSM) and Cross-Scan Mechanism (CSM), which enable efficient processing of visual data. The model is evaluated on various multimodal benchmarks, demonstrating competitive performance with other large multimodal language models. The results show that VL-Mamba achieves strong performance on tasks such as visual question answering, image captioning, and visual reasoning, with a significant reduction in computational complexity compared to traditional models. The paper also conducts ablation studies to evaluate the effectiveness of different components, including language models, vision encoders, and MMC architectures. The findings indicate that the VSS module significantly improves performance, especially in tasks requiring 2D visual processing. The model is open-sourced to promote further research into state space models for multimodal learning.
Reach us at info@study.space