25 Mar 2024 | Kecheng Zheng*,1,2, Yifei Zhang*,3, Wei Wu4, Fan Lu4, Shuailei Ma6, Xin Jin5, Wei Chen1, and Yujun Shen2
**DreamLIP: Language-Image Pre-training with Long Captions**
**Authors:** Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen
**Abstract:**
Language-image pre-training heavily relies on the precision and thoroughness of text descriptions for paired images. However, images often contain rich content that requires lengthy captions (e.g., 10 sentences) to describe adequately, which are typically missing in existing datasets. This paper addresses the question of whether and how long captions can benefit language-image pre-training. To explore this, the authors re-caption 30 million images using a pre-trained Multi-modality Large Language Model (MLLM) and study the usage of these captions under a contrastive learning framework. They observe that each sentence in a long caption likely describes a partial aspect of the image. Inspired by this, they propose to dynamically sample sub-captions from the text label to construct multiple positive pairs and introduce a grouping loss to match the embeddings of each sub-caption with its corresponding local image patches in a self-supervised manner. Experimental results on various downstream tasks demonstrate the superior performance of their method, named DreamLIP, over previous alternatives, highlighting its fine-grained representational capacity. Notably, DreamLIP trained with 30 million image-text pairs achieves comparable or even better performance than CLIP trained with 400 million pairs on tasks such as image-text retrieval and semantic segmentation.
**Keywords:** Language-image pre-training, Long caption, Multi-modal learning.**DreamLIP: Language-Image Pre-training with Long Captions**
**Authors:** Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen
**Abstract:**
Language-image pre-training heavily relies on the precision and thoroughness of text descriptions for paired images. However, images often contain rich content that requires lengthy captions (e.g., 10 sentences) to describe adequately, which are typically missing in existing datasets. This paper addresses the question of whether and how long captions can benefit language-image pre-training. To explore this, the authors re-caption 30 million images using a pre-trained Multi-modality Large Language Model (MLLM) and study the usage of these captions under a contrastive learning framework. They observe that each sentence in a long caption likely describes a partial aspect of the image. Inspired by this, they propose to dynamically sample sub-captions from the text label to construct multiple positive pairs and introduce a grouping loss to match the embeddings of each sub-caption with its corresponding local image patches in a self-supervised manner. Experimental results on various downstream tasks demonstrate the superior performance of their method, named DreamLIP, over previous alternatives, highlighting its fine-grained representational capacity. Notably, DreamLIP trained with 30 million image-text pairs achieves comparable or even better performance than CLIP trained with 400 million pairs on tasks such as image-text retrieval and semantic segmentation.
**Keywords:** Language-image pre-training, Long caption, Multi-modal learning.