6 Feb 2024 | Xiangxiang Chu1*, Limeng Qiao1*, Xinyu Zhang1*, Shuang Xu1, Fei Wei1, Yang Yang1,3, Xiaofei Sun1, Yiming Hu,1 Xinyang Lin1, Bo Zhang1, Chunhua Shen2
**MobileVLM V2: Faster and Stronger Baseline for Vision Language Model**
**Authors:** Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen
**Institution:** Meituan Inc., Zhejiang University, Dalian University of Technology
**Abstract:**
This paper introduces *MobileVLM V2*, an improved family of vision language models based on MobileVLM. The key contributions include novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared to much larger VLMs at the 3B scale. Notably, the 3B model outperforms a variety of VLMs at the 7B+ scale. The models will be released at <https://github.com/Meituan-AutoML/MobileVLM>.
**Introduction:**
Vision language models (VLMs) have become crucial in AI research, integrating large language models (LLMs) with multi-modal features. However, challenges remain in deploying VLMs to real-world scenarios like mobile devices and self-driving cars. MobileVLM [15] explored the capacity of VLMs at the mobile scale with innovative hardware-oriented architectures. This paper builds upon MobileVLM, focusing on three main improvements: exploiting contributive training data, exploring effective training strategies, and renovating a high-performance lightweight projector.
**Contributions:**
1. Exploring and evaluating the performance of increasing training data for small VLMs.
2. Designing better training strategies for mobile scenarios and a novel training scheme to fully exploit high-quality multimodal data.
3. Achieving a new state-of-the-art tradeoff between performance and inference speed across several VLM benchmarks.
**Method:**
The MobileVLM V2 architecture consists of a pre-trained vision encoder, a pre-trained large language model, and a mobile-friendly projector. The key components include:
- **Vision Encoder:** Uses CLIP ViT-L/14 for extracting image features.
- **Language Model:** Employes MobileLLaMA for processing multi-modal tokens and generating answers.
- **Lightweight Downsample Projector (LDPv2):** Enhances vision-language feature alignment with fewer parameters.
**Training Strategy:**
The training process is split into two stages: pre-training and multi-task training. During pre-training, the projector and language model are fully trained, while the visual encoder is frozen. Multi-task training involves multiple vision-language tasks to enhance the model's capabilities.
**Experiments:**
- **Performance Evaluation:** MobileVLM V2 achieves new state-of-the-art results with faster inference speed.
- **Latency Comparison:****MobileVLM V2: Faster and Stronger Baseline for Vision Language Model**
**Authors:** Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen
**Institution:** Meituan Inc., Zhejiang University, Dalian University of Technology
**Abstract:**
This paper introduces *MobileVLM V2*, an improved family of vision language models based on MobileVLM. The key contributions include novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared to much larger VLMs at the 3B scale. Notably, the 3B model outperforms a variety of VLMs at the 7B+ scale. The models will be released at <https://github.com/Meituan-AutoML/MobileVLM>.
**Introduction:**
Vision language models (VLMs) have become crucial in AI research, integrating large language models (LLMs) with multi-modal features. However, challenges remain in deploying VLMs to real-world scenarios like mobile devices and self-driving cars. MobileVLM [15] explored the capacity of VLMs at the mobile scale with innovative hardware-oriented architectures. This paper builds upon MobileVLM, focusing on three main improvements: exploiting contributive training data, exploring effective training strategies, and renovating a high-performance lightweight projector.
**Contributions:**
1. Exploring and evaluating the performance of increasing training data for small VLMs.
2. Designing better training strategies for mobile scenarios and a novel training scheme to fully exploit high-quality multimodal data.
3. Achieving a new state-of-the-art tradeoff between performance and inference speed across several VLM benchmarks.
**Method:**
The MobileVLM V2 architecture consists of a pre-trained vision encoder, a pre-trained large language model, and a mobile-friendly projector. The key components include:
- **Vision Encoder:** Uses CLIP ViT-L/14 for extracting image features.
- **Language Model:** Employes MobileLLaMA for processing multi-modal tokens and generating answers.
- **Lightweight Downsample Projector (LDPv2):** Enhances vision-language feature alignment with fewer parameters.
**Training Strategy:**
The training process is split into two stages: pre-training and multi-task training. During pre-training, the projector and language model are fully trained, while the visual encoder is frozen. Multi-task training involves multiple vision-language tasks to enhance the model's capabilities.
**Experiments:**
- **Performance Evaluation:** MobileVLM V2 achieves new state-of-the-art results with faster inference speed.
- **Latency Comparison:**