What If We Recaption Billions of Web Images with LLaMA-3?

What If We Recaption Billions of Web Images with LLaMA-3?

18 Jun 2024 | Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie
This paper introduces Recap-DataComp-1B, a large-scale image-text dataset generated using the LLaMA-3-powered Llava model. The dataset consists of 1.3 billion web-crawled image-text pairs, with each image recaptioned to provide more detailed and accurate textual descriptions. The recaptioning process involves fine-tuning a LLaMA-3-8B-powered LLaVA model to generate enhanced captions for the DataComp-1B dataset, resulting in a new dataset called Recap-DataComp-1B. The paper demonstrates that this enhanced dataset significantly improves the performance of vision-language models, particularly in tasks such as cross-modal retrieval and text-to-image generation. The recaptioned data provides more detailed and accurate descriptions, leading to better alignment between images and text, and improved performance in both discriminative and generative vision-language tasks. The paper also evaluates the effectiveness of the recaptioned data in training CLIP models, showing that the enhanced dataset leads to significant improvements in zero-shot cross-modal retrieval performance. Additionally, the paper evaluates the performance of text-to-image generation models trained on the recaptioned data, showing that the enhanced descriptions lead to better alignment with user-provided text instructions and improved image generation quality. The results indicate that the recaptioned data significantly enhances the performance of vision-language models, making it a valuable resource for further research and development in this area.This paper introduces Recap-DataComp-1B, a large-scale image-text dataset generated using the LLaMA-3-powered Llava model. The dataset consists of 1.3 billion web-crawled image-text pairs, with each image recaptioned to provide more detailed and accurate textual descriptions. The recaptioning process involves fine-tuning a LLaMA-3-8B-powered LLaVA model to generate enhanced captions for the DataComp-1B dataset, resulting in a new dataset called Recap-DataComp-1B. The paper demonstrates that this enhanced dataset significantly improves the performance of vision-language models, particularly in tasks such as cross-modal retrieval and text-to-image generation. The recaptioned data provides more detailed and accurate descriptions, leading to better alignment between images and text, and improved performance in both discriminative and generative vision-language tasks. The paper also evaluates the effectiveness of the recaptioned data in training CLIP models, showing that the enhanced dataset leads to significant improvements in zero-shot cross-modal retrieval performance. Additionally, the paper evaluates the performance of text-to-image generation models trained on the recaptioned data, showing that the enhanced descriptions lead to better alignment with user-provided text instructions and improved image generation quality. The results indicate that the recaptioned data significantly enhances the performance of vision-language models, making it a valuable resource for further research and development in this area.
Reach us at info@study.space