Understanding Enhancing Large Vision Language Models with Self-Training on Image Comprehension

The paper introduces Self-Training on Image Comprehension (STIC), a novel self-training approach designed to enhance the image comprehension capabilities of large vision language models (LVLMs). STIC consists of two stages: the first stage constructs a preference dataset for image descriptions using unlabeled images, while the second stage infuses this dataset with existing instruction-tuning data to further fine-tune the model. The method leverages both preferred and dispreferred responses generated by the model to improve its image perception and reasoning abilities. Extensive experiments on seven vision-language benchmarks demonstrate that STIC achieves significant performance improvements, with an average accuracy gain of 4.0% and a reduction in supervised fine-tuning data by 70%. The paper also explores the benefits of various components of STIC, including the importance of negative samples and the effectiveness of description-infused fine-tuning. The results highlight the potential of STIC to harness large amounts of unlabeled images for cost-effective model enhancement.The paper introduces Self-Training on Image Comprehension (STIC), a novel self-training approach designed to enhance the image comprehension capabilities of large vision language models (LVLMs). STIC consists of two stages: the first stage constructs a preference dataset for image descriptions using unlabeled images, while the second stage infuses this dataset with existing instruction-tuning data to further fine-tune the model. The method leverages both preferred and dispreferred responses generated by the model to improve its image perception and reasoning abilities. Extensive experiments on seven vision-language benchmarks demonstrate that STIC achieves significant performance improvements, with an average accuracy gain of 4.0% and a reduction in supervised fine-tuning data by 70%. The paper also explores the benefits of various components of STIC, including the importance of negative samples and the effectiveness of description-infused fine-tuning. The results highlight the potential of STIC to harness large amounts of unlabeled images for cost-effective model enhancement.

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

30 May 2024 | Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, James Zou, Kai-Wei Chang, Wei Wang