ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

17 Jun 2024 | Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen*, Jianquan Li, Xiang Wan, Benyou Wang*
The paper "ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models" addresses the challenge of training lightweight vision-language models (LVLMs) with high-quality data to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions. The authors propose a comprehensive pipeline for generating a synthetic dataset, leveraging strong proprietary models to create fine-grained image annotations and complex reasoning visual question-answering pairs. The dataset, named ALLaVA, consists of 1.3 million samples and is trained on a series of lite VLMs, demonstrating competitive performance on 17 benchmarks among 4B LVLMs and even matching the performance of 7B/13B-scale models on various benchmarks. The paper highlights the feasibility of using high-quality data to enhance the efficiency and performance of LVLMs, making them more accessible and widely applicable. The dataset and models are open-sourced to the research community to foster further development and improvement in LVLMs.The paper "ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models" addresses the challenge of training lightweight vision-language models (LVLMs) with high-quality data to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions. The authors propose a comprehensive pipeline for generating a synthetic dataset, leveraging strong proprietary models to create fine-grained image annotations and complex reasoning visual question-answering pairs. The dataset, named ALLaVA, consists of 1.3 million samples and is trained on a series of lite VLMs, demonstrating competitive performance on 17 benchmarks among 4B LVLMs and even matching the performance of 7B/13B-scale models on various benchmarks. The paper highlights the feasibility of using high-quality data to enhance the efficiency and performance of LVLMs, making them more accessible and widely applicable. The dataset and models are open-sourced to the research community to foster further development and improvement in LVLMs.
Reach us at info@study.space
[slides] ALLaVA%3A Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models | StudySpace