6 Feb 2024 | Xiangxiang Chu; Limeng Qiao; Xinyu Zhang; Shuang Xu; Fei Wei; Yang Yang; Xiaofei Sun; Yiming Hu; Xinyang Lin; Bo Zhang; Chunhua Shen
MobileVLM V2 is a significantly improved vision language model (VLM) based on MobileVLM, offering faster inference speed and better performance. The model leverages a novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation to enhance performance. MobileVLM V2 1.7B achieves better or comparable performance on standard VLM benchmarks compared to larger models at the 3B scale, and the 3B model outperforms many 7B+ VLMs. The models are released at https://github.com/MeituanAutoML/MobileVLM.
Key improvements include utilizing high-quality image-text pairs for better alignment of vision-language features, incorporating diverse academic tasks to enhance instruction-following capacity, and developing a more efficient projector. The lightweight downsample projector (LDPv2) reduces the number of image tokens while maintaining performance. The model is trained with a comprehensive strategy that includes pre-training and multi-task training, allowing it to handle various vision-language tasks effectively.
MobileVLM V2 achieves a new state-of-the-art tradeoff between performance and inference speed across several VLM benchmarks. The 7B model outperforms previous SOTA models with clear margins. The model is designed for resource-constrained scenarios, demonstrating strong performance on mobile devices and other edge environments. The model's efficiency and performance make it suitable for real-world applications, including mobile devices, self-driving cars, and embodied AI systems.
The model's training strategy includes pre-training on a large dataset and multi-task training to enhance its capabilities. The model's architecture includes a pre-trained vision encoder, a pre-trained large language model, and a lightweight projector for aligning vision and language features. The model's performance is evaluated on various benchmarks, showing significant improvements in accuracy and inference speed compared to previous models. The model's design and training strategy contribute to its effectiveness in vision-language tasks, making it a strong baseline for future research and applications.MobileVLM V2 is a significantly improved vision language model (VLM) based on MobileVLM, offering faster inference speed and better performance. The model leverages a novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation to enhance performance. MobileVLM V2 1.7B achieves better or comparable performance on standard VLM benchmarks compared to larger models at the 3B scale, and the 3B model outperforms many 7B+ VLMs. The models are released at https://github.com/MeituanAutoML/MobileVLM.
Key improvements include utilizing high-quality image-text pairs for better alignment of vision-language features, incorporating diverse academic tasks to enhance instruction-following capacity, and developing a more efficient projector. The lightweight downsample projector (LDPv2) reduces the number of image tokens while maintaining performance. The model is trained with a comprehensive strategy that includes pre-training and multi-task training, allowing it to handle various vision-language tasks effectively.
MobileVLM V2 achieves a new state-of-the-art tradeoff between performance and inference speed across several VLM benchmarks. The 7B model outperforms previous SOTA models with clear margins. The model is designed for resource-constrained scenarios, demonstrating strong performance on mobile devices and other edge environments. The model's efficiency and performance make it suitable for real-world applications, including mobile devices, self-driving cars, and embodied AI systems.
The model's training strategy includes pre-training on a large dataset and multi-task training to enhance its capabilities. The model's architecture includes a pre-trained vision encoder, a pre-trained large language model, and a lightweight projector for aligning vision and language features. The model's performance is evaluated on various benchmarks, showing significant improvements in accuracy and inference speed compared to previous models. The model's design and training strategy contribute to its effectiveness in vision-language tasks, making it a strong baseline for future research and applications.