[slides and audio] MobileVLM V2%3A Faster and Stronger Baseline for Vision Language Model

**MobileVLM V2: Faster and Stronger Baseline for Vision Language Model** **Authors:** Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen **Institution:** Meituan Inc., Zhejiang University, Dalian University of Technology **Abstract:** This paper introduces *MobileVLM V2*, an improved family of vision language models based on MobileVLM. The key contributions include novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared to much larger VLMs at the 3B scale. Notably, the 3B model outperforms a variety of VLMs at the 7B+ scale. The models will be released at <https://github.com/Meituan-AutoML/MobileVLM>. **Introduction:** Vision language models (VLMs) have become crucial in AI research, integrating large language models (LLMs) with multi-modal features. However, challenges remain in deploying VLMs to real-world scenarios like mobile devices and self-driving cars. MobileVLM [15] explored the capacity of VLMs at the mobile scale with innovative hardware-oriented architectures. This paper builds upon MobileVLM, focusing on three main improvements: exploiting contributive training data, exploring effective training strategies, and renovating a high-performance lightweight projector. **Contributions:** 1. Exploring and evaluating the performance of increasing training data for small VLMs. 2. Designing better training strategies for mobile scenarios and a novel training scheme to fully exploit high-quality multimodal data. 3. Achieving a new state-of-the-art tradeoff between performance and inference speed across several VLM benchmarks. **Method:** The MobileVLM V2 architecture consists of a pre-trained vision encoder, a pre-trained large language model, and a mobile-friendly projector. The key components include: - **Vision Encoder:** Uses CLIP ViT-L/14 for extracting image features. - **Language Model:** Employes MobileLLaMA for processing multi-modal tokens and generating answers. - **Lightweight Downsample Projector (LDPv2):** Enhances vision-language feature alignment with fewer parameters. **Training Strategy:** The training process is split into two stages: pre-training and multi-task training. During pre-training, the projector and language model are fully trained, while the visual encoder is frozen. Multi-task training involves multiple vision-language tasks to enhance the model's capabilities. **Experiments:** - **Performance Evaluation:** MobileVLM V2 achieves new state-of-the-art results with faster inference speed. - **Latency Comparison:****MobileVLM V2: Faster and Stronger Baseline for Vision Language Model** **Authors:** Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen **Institution:** Meituan Inc., Zhejiang University, Dalian University of Technology **Abstract:** This paper introduces *MobileVLM V2*, an improved family of vision language models based on MobileVLM. The key contributions include novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared to much larger VLMs at the 3B scale. Notably, the 3B model outperforms a variety of VLMs at the 7B+ scale. The models will be released at <https://github.com/Meituan-AutoML/MobileVLM>. **Introduction:** Vision language models (VLMs) have become crucial in AI research, integrating large language models (LLMs) with multi-modal features. However, challenges remain in deploying VLMs to real-world scenarios like mobile devices and self-driving cars. MobileVLM [15] explored the capacity of VLMs at the mobile scale with innovative hardware-oriented architectures. This paper builds upon MobileVLM, focusing on three main improvements: exploiting contributive training data, exploring effective training strategies, and renovating a high-performance lightweight projector. **Contributions:** 1. Exploring and evaluating the performance of increasing training data for small VLMs. 2. Designing better training strategies for mobile scenarios and a novel training scheme to fully exploit high-quality multimodal data. 3. Achieving a new state-of-the-art tradeoff between performance and inference speed across several VLM benchmarks. **Method:** The MobileVLM V2 architecture consists of a pre-trained vision encoder, a pre-trained large language model, and a mobile-friendly projector. The key components include: - **Vision Encoder:** Uses CLIP ViT-L/14 for extracting image features. - **Language Model:** Employes MobileLLaMA for processing multi-modal tokens and generating answers. - **Lightweight Downsample Projector (LDPv2):** Enhances vision-language feature alignment with fewer parameters. **Training Strategy:** The training process is split into two stages: pre-training and multi-task training. During pre-training, the projector and language model are fully trained, while the visual encoder is frozen. Multi-task training involves multiple vision-language tasks to enhance the model's capabilities. **Experiments:** - **Performance Evaluation:** MobileVLM V2 achieves new state-of-the-art results with faster inference speed. - **Latency Comparison:**

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

6 Feb 2024 | Xiangxiang Chu1*, Limeng Qiao1*, Xinyu Zhang1*, Shuang Xu1, Fei Wei1, Yang Yang1,3, Xiaofei Sun1, Yiming Hu,1 Xinyang Lin1, Bo Zhang1, Chunhua Shen2

6 Feb 2024 | Xiangxiang Chu1, Limeng Qiao1, Xinyu Zhang1*, Shuang Xu1, Fei Wei1, Yang Yang1,3, Xiaofei Sun1, Yiming Hu,1 Xinyang Lin1, Bo Zhang1, Chunhua Shen2