[slides] What If We Recaption Billions of Web Images with LLaMA-3%3F

This paper introduces Recap-DataComp-1B, a large-scale image dataset paired with detailed textual descriptions, generated using the LLaMA-3-powered Llava model. The authors aim to enhance the quality of web-crawled image-text pairs, which often suffer from misalignment and lack of detailed information. By fine-tuning a LLaMA-3-8B powered Llava model on the DataComp-1B dataset, they create a recaptioned dataset that offers substantial benefits in training advanced vision-language models. The empirical results show that Recap-DataComp-1B significantly improves the zero-shot performance of CLIP in cross-modal retrieval tasks and enhances the alignment between generated images and text instructions in text-to-image generative models. The paper also provides a detailed analysis of the recaptioned content, demonstrating improved word distribution, length, and semantic quality. Additionally, the authors evaluate the impact of different mix ratios between original and recaptioned captions on CLIP performance and the effects of larger text encoders. Finally, they assess the quality of text-to-image generation models trained on Recap-DataComp-1B, showing significant improvements in both FID and CLIP scores. The project page for Recap-DataComp-1B is available at <https://www.haqtu.me/Recap-Datacomp-1B>.This paper introduces Recap-DataComp-1B, a large-scale image dataset paired with detailed textual descriptions, generated using the LLaMA-3-powered Llava model. The authors aim to enhance the quality of web-crawled image-text pairs, which often suffer from misalignment and lack of detailed information. By fine-tuning a LLaMA-3-8B powered Llava model on the DataComp-1B dataset, they create a recaptioned dataset that offers substantial benefits in training advanced vision-language models. The empirical results show that Recap-DataComp-1B significantly improves the zero-shot performance of CLIP in cross-modal retrieval tasks and enhances the alignment between generated images and text instructions in text-to-image generative models. The paper also provides a detailed analysis of the recaptioned content, demonstrating improved word distribution, length, and semantic quality. Additionally, the authors evaluate the impact of different mix ratios between original and recaptioned captions on CLIP performance and the effects of larger text encoders. Finally, they assess the quality of text-to-image generation models trained on Recap-DataComp-1B, showing significant improvements in both FID and CLIP scores. The project page for Recap-DataComp-1B is available at <https://www.haqtu.me/Recap-Datacomp-1B>.

What If We Recaption Billions of Web Images with LLaMA-3?

18 Jun 2024 | Xianhang Li Haoqin Tu Mude Hui Zeyu Wang Bingchen Zhao Junfei Xiao Sucheng Ren Jieru Mei Qing Liu Huangjie Zheng Yuyin Zhou Cihang Xie