Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

5 Jun 2024 | Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang
Cobra is a novel multimodal large language model (MLLM) designed to address the computational efficiency issues of current MLLMs, which are primarily based on the Transformer network with quadratic complexity. The proposed model integrates the efficient Mamba language model into the visual modality and explores various modal fusion schemes to create an effective multi-modal Mamba. Experiments demonstrate that Cobra achieves competitive performance with current state-of-the-art methods like LLaVA-Phi, TinyLLAVA, and MobileVLM v2, while being significantly faster due to its linear sequential modeling. Notably, Cobra performs well in overcoming visual illusions and spatial relationship judgments, and it achieves comparable performance to LLaVA with about 43% of the parameters. The paper also discusses the limitations of Cobra, such as its weaker performance in open-vocabulary question answering and the need for high precision during inference. Overall, Cobra opens new possibilities for deploying high-performance AI models in environments requiring high-frequency processing of visual information.Cobra is a novel multimodal large language model (MLLM) designed to address the computational efficiency issues of current MLLMs, which are primarily based on the Transformer network with quadratic complexity. The proposed model integrates the efficient Mamba language model into the visual modality and explores various modal fusion schemes to create an effective multi-modal Mamba. Experiments demonstrate that Cobra achieves competitive performance with current state-of-the-art methods like LLaVA-Phi, TinyLLAVA, and MobileVLM v2, while being significantly faster due to its linear sequential modeling. Notably, Cobra performs well in overcoming visual illusions and spatial relationship judgments, and it achieves comparable performance to LLaVA with about 43% of the parameters. The paper also discusses the limitations of Cobra, such as its weaker performance in open-vocabulary question answering and the need for high precision during inference. Overall, Cobra opens new possibilities for deploying high-performance AI models in environments requiring high-frequency processing of visual information.
Reach us at info@study.space
Understanding Cobra%3A Extending Mamba to Multi-Modal Large Language Model for Efficient Inference