Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

5 Jun 2024 | Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang
Cobra is a linear computational complexity multimodal large language model (MLLM) designed to improve the efficiency of existing models that rely on the Transformer architecture with quadratic computational complexity. The model integrates the efficient Mamba language model into the visual modality and explores various modal fusion schemes to create an effective multi-modal Mamba. Cobra achieves competitive performance with state-of-the-art computationally efficient methods like LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and is significantly faster due to its linear sequential modeling. It also performs well in overcoming visual illusions and spatial relationship judgments, and even achieves comparable performance to LLaVA with about 43% fewer parameters. Cobra is open-sourced to facilitate future research on complexity problems in MLLMs. The model is built with three components: a vision encoder, a projector, and a Mamba backbone. The vision encoder uses DINOv2 and SigLIP as the backbone, combining low-level spatial properties from DINOv2 and semantic properties from SigLIP to improve performance on downstream tasks. The projector aligns the visual and text features by transforming the dimension of the original visual representation to the dimension of the tokens in the Mamba language model. The Mamba backbone is a stack of 64 identical basic blocks with residual connections and RM-SNorm. The model receives the concatenation of visual embeddings transformed from the projection layer and text embeddings, and transforms this sequence into a target token sequence in an autoregressive manner. Cobra is trained on a combined dataset of 1.2 million images and corresponding multi-turn dialogue data, as well as pure text dialogue data. The training process involves two epochs of fine-tuning the entire LLM backbone and the projector. The model is evaluated on six benchmarks, including four open-ended visual question-answering tasks and two closed-set prediction tasks. The results show that Cobra achieves better performance than previous methods that use the identical Transformer backbone and performs 3-4 times faster than MobileVLM v2 3B and TinyLLaVA 3B. It also performs comparably to the much larger LLaVA v1.5 model with 7B parameters. Cobra's inference speed is significantly faster than Transformer-based models, even when the number of visual tokens increases. The model's memory usage does not escalate with the length of the visual tokens, as RNN-like models maintain a constant-sized hidden state to store historical information during inference. The model's performance is evaluated on various benchmarks, including VQA, VSR, and POPE, and it shows strong ability to capture spatial relationships and reduces hallucinations. The model is competitive in the field of Multimodal Large Language Models (MLLM), especially in processing visual information and generating natural language descriptions. However, there are still some limitations, such as the model's performance in open-vocabulary question answering and the need for high precision during inference.Cobra is a linear computational complexity multimodal large language model (MLLM) designed to improve the efficiency of existing models that rely on the Transformer architecture with quadratic computational complexity. The model integrates the efficient Mamba language model into the visual modality and explores various modal fusion schemes to create an effective multi-modal Mamba. Cobra achieves competitive performance with state-of-the-art computationally efficient methods like LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and is significantly faster due to its linear sequential modeling. It also performs well in overcoming visual illusions and spatial relationship judgments, and even achieves comparable performance to LLaVA with about 43% fewer parameters. Cobra is open-sourced to facilitate future research on complexity problems in MLLMs. The model is built with three components: a vision encoder, a projector, and a Mamba backbone. The vision encoder uses DINOv2 and SigLIP as the backbone, combining low-level spatial properties from DINOv2 and semantic properties from SigLIP to improve performance on downstream tasks. The projector aligns the visual and text features by transforming the dimension of the original visual representation to the dimension of the tokens in the Mamba language model. The Mamba backbone is a stack of 64 identical basic blocks with residual connections and RM-SNorm. The model receives the concatenation of visual embeddings transformed from the projection layer and text embeddings, and transforms this sequence into a target token sequence in an autoregressive manner. Cobra is trained on a combined dataset of 1.2 million images and corresponding multi-turn dialogue data, as well as pure text dialogue data. The training process involves two epochs of fine-tuning the entire LLM backbone and the projector. The model is evaluated on six benchmarks, including four open-ended visual question-answering tasks and two closed-set prediction tasks. The results show that Cobra achieves better performance than previous methods that use the identical Transformer backbone and performs 3-4 times faster than MobileVLM v2 3B and TinyLLaVA 3B. It also performs comparably to the much larger LLaVA v1.5 model with 7B parameters. Cobra's inference speed is significantly faster than Transformer-based models, even when the number of visual tokens increases. The model's memory usage does not escalate with the length of the visual tokens, as RNN-like models maintain a constant-sized hidden state to store historical information during inference. The model's performance is evaluated on various benchmarks, including VQA, VSR, and POPE, and it shows strong ability to capture spatial relationships and reduces hallucinations. The model is competitive in the field of Multimodal Large Language Models (MLLM), especially in processing visual information and generating natural language descriptions. However, there are still some limitations, such as the model's performance in open-vocabulary question answering and the need for high precision during inference.
Reach us at info@study.space
[slides] Cobra%3A Extending Mamba to Multi-Modal Large Language Model for Efficient Inference | StudySpace