30 May 2024 | Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, James Zou, Kai-Wei Chang, Wei Wang
This paper introduces STIC, a two-stage self-training approach for enhancing the image comprehension capabilities of large vision language models (LVLMs). STIC leverages self-generated data to improve performance without relying on pre-labeled image information. The first stage constructs a preference dataset for image descriptions using unlabeled images, generating preferred and dispreferred responses through well-designed prompts and corrupted images. The second stage infuses self-generated image descriptions into instruction-tuning data to further refine the model's reasoning abilities.
The method is validated across seven vision-language benchmarks, demonstrating an average accuracy gain of 4.0% while using 70% less supervised fine-tuning data than the current method. STIC achieves significant improvements on tasks such as ScienceQA, with a 6.4% gain. The approach also highlights the importance of dispreferred samples in aligning preferences and improving model performance.
Experiments show that STIC can effectively leverage large amounts of unlabeled image data, making it a cost-effective solution for enhancing LVLMs. The method is scalable, with performance gains increasing as the amount of preference data increases. Additionally, the effectiveness of STIC is demonstrated through ablation studies, showing that the inclusion of dispreferred samples and self-generated descriptions significantly improves model performance.
The paper also discusses the correlation between image distribution and performance gains, showing that models with image distributions similar to the MSCOCO dataset achieve higher performance gains. Overall, STIC provides a novel and effective approach to self-training for LVLMs, enhancing their image comprehension and reasoning capabilities.This paper introduces STIC, a two-stage self-training approach for enhancing the image comprehension capabilities of large vision language models (LVLMs). STIC leverages self-generated data to improve performance without relying on pre-labeled image information. The first stage constructs a preference dataset for image descriptions using unlabeled images, generating preferred and dispreferred responses through well-designed prompts and corrupted images. The second stage infuses self-generated image descriptions into instruction-tuning data to further refine the model's reasoning abilities.
The method is validated across seven vision-language benchmarks, demonstrating an average accuracy gain of 4.0% while using 70% less supervised fine-tuning data than the current method. STIC achieves significant improvements on tasks such as ScienceQA, with a 6.4% gain. The approach also highlights the importance of dispreferred samples in aligning preferences and improving model performance.
Experiments show that STIC can effectively leverage large amounts of unlabeled image data, making it a cost-effective solution for enhancing LVLMs. The method is scalable, with performance gains increasing as the amount of preference data increases. Additionally, the effectiveness of STIC is demonstrated through ablation studies, showing that the inclusion of dispreferred samples and self-generated descriptions significantly improves model performance.
The paper also discusses the correlation between image distribution and performance gains, showing that models with image distributions similar to the MSCOCO dataset achieve higher performance gains. Overall, STIC provides a novel and effective approach to self-training for LVLMs, enhancing their image comprehension and reasoning capabilities.