25 Mar 2024 | Kecheng Zheng*, Yifei Zhang*, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen
DreamLIP is a language-image pre-training method that leverages long captions to enhance the representation capacity of vision-language models. The method re-captures 30 million images using a pre-trained multi-modal large language model (MLLM) to generate detailed long captions. These captions are then used in a contrastive learning framework to construct multiple positive pairs and introduce a grouping loss to align sub-captions with corresponding image patches. The approach demonstrates superior performance on various downstream tasks, including image-text retrieval, semantic segmentation, and image understanding, often matching or exceeding the performance of CLIP, which is trained on a much larger dataset. The use of long captions allows the model to capture more detailed and nuanced information, leading to better fine-grained representation. Experimental results show that DreamLIP outperforms previous methods in multiple tasks, highlighting the effectiveness of long captions in improving language-image pre-training. The method also includes ablation studies to evaluate the impact of different components, such as short captions, long captions, and subcaption-specific grouping loss. The results indicate that long captions significantly enhance the model's ability to understand and represent visual information. The study also explores the impact of different MLLMs on caption generation and shows that using multiple MLLMs can lead to better performance. Overall, DreamLIP represents a promising advancement in vision-language pre-training by effectively utilizing long captions to improve model performance and representation capacity.DreamLIP is a language-image pre-training method that leverages long captions to enhance the representation capacity of vision-language models. The method re-captures 30 million images using a pre-trained multi-modal large language model (MLLM) to generate detailed long captions. These captions are then used in a contrastive learning framework to construct multiple positive pairs and introduce a grouping loss to align sub-captions with corresponding image patches. The approach demonstrates superior performance on various downstream tasks, including image-text retrieval, semantic segmentation, and image understanding, often matching or exceeding the performance of CLIP, which is trained on a much larger dataset. The use of long captions allows the model to capture more detailed and nuanced information, leading to better fine-grained representation. Experimental results show that DreamLIP outperforms previous methods in multiple tasks, highlighting the effectiveness of long captions in improving language-image pre-training. The method also includes ablation studies to evaluate the impact of different components, such as short captions, long captions, and subcaption-specific grouping loss. The results indicate that long captions significantly enhance the model's ability to understand and represent visual information. The study also explores the impact of different MLLMs on caption generation and shows that using multiple MLLMs can lead to better performance. Overall, DreamLIP represents a promising advancement in vision-language pre-training by effectively utilizing long captions to improve model performance and representation capacity.