[slides and audio] Small Language Model Meets with Reinforced Vision Vocabulary

This article introduces Vary-ttoy, a small-scale Large Vision Language Model (LVLM) that achieves competitive performance with larger models while being more accessible for researchers with limited resources. Vary-ttoy is based on the Qwen-1.8B language model and incorporates an improved vision vocabulary to enhance its ability to understand and generate text from images. The vision vocabulary is generated using a combination of PDF image-text pairs and object detection data, which allows the model to better encode visual information and improve its performance on tasks such as document understanding, image captioning, and visual question answering. The model's performance is evaluated on several benchmark datasets, including DocVQA, ChartQA, RefCOCO, and MMVet. Vary-ttoy achieves 65.6% ANLS on DocVQA, 59.1% accuracy on ChartQA, 88.1% accuracy on RefCOCO, and 29% accuracy on MMVet. These results demonstrate that Vary-ttoy can match or exceed the performance of larger models like Qwen-VL-7B and LLAVA-7B on various tasks, despite its smaller size. The article also discusses the challenges of training and deploying large LVLMs, such as the high computational costs and the difficulty of training on consumer-grade GPUs. Vary-ttoy addresses these issues by using a more efficient vision vocabulary generation process that leverages both dense text and natural object location data. This approach allows the model to better utilize the capacity of the vocabulary network and improve its ability to encode visual information. The authors conclude that Vary-ttoy is not just a toy model but a promising baseline for LVLM research, especially for researchers with limited resources. They believe that Vary-ttoy has significant potential for further improvement and could serve as a practical foundation for future research in the field of vision-language models. The code for Vary-ttoy is publicly available, making it accessible for researchers to experiment with and build upon.This article introduces Vary-ttoy, a small-scale Large Vision Language Model (LVLM) that achieves competitive performance with larger models while being more accessible for researchers with limited resources. Vary-ttoy is based on the Qwen-1.8B language model and incorporates an improved vision vocabulary to enhance its ability to understand and generate text from images. The vision vocabulary is generated using a combination of PDF image-text pairs and object detection data, which allows the model to better encode visual information and improve its performance on tasks such as document understanding, image captioning, and visual question answering. The model's performance is evaluated on several benchmark datasets, including DocVQA, ChartQA, RefCOCO, and MMVet. Vary-ttoy achieves 65.6% ANLS on DocVQA, 59.1% accuracy on ChartQA, 88.1% accuracy on RefCOCO, and 29% accuracy on MMVet. These results demonstrate that Vary-ttoy can match or exceed the performance of larger models like Qwen-VL-7B and LLAVA-7B on various tasks, despite its smaller size. The article also discusses the challenges of training and deploying large LVLMs, such as the high computational costs and the difficulty of training on consumer-grade GPUs. Vary-ttoy addresses these issues by using a more efficient vision vocabulary generation process that leverages both dense text and natural object location data. This approach allows the model to better utilize the capacity of the vocabulary network and improve its ability to encode visual information. The authors conclude that Vary-ttoy is not just a toy model but a promising baseline for LVLM research, especially for researchers with limited resources. They believe that Vary-ttoy has significant potential for further improvement and could serve as a practical foundation for future research in the field of vision-language models. The code for Vary-ttoy is publicly available, making it accessible for researchers to experiment with and build upon.

Small Language Model Meets with Reinforced Vision Vocabulary

23 Jan 2024 | Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, En Yu, Jianjian Sun, Chunrui Han, Xiangyu Zhang