Long-CLIP: Unlocking the Long-Text Capability of CLIP

Long-CLIP: Unlocking the Long-Text Capability of CLIP

22 Jul 2024 | Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang
Long-CLIP is a modified version of the Contrastive Language-Image Pre-training (CLIP) model that enables it to handle long text inputs. CLIP has been widely used for zero-shot classification, text-image retrieval, and text-to-image generation by aligning image and text modalities. However, CLIP's text input is limited to 77 tokens, which restricts its ability to process detailed descriptions. Long-CLIP addresses this limitation by supporting long-text input while retaining or even surpassing CLIP's zero-shot generalizability. It achieves this through two novel strategies: (1) knowledge-preserved stretching of positional embeddings and (2) primary component matching of CLIP features. Long-CLIP is trained using just one million long text-image pairs and has shown significant improvements in long caption text-image retrieval (20% improvement) and traditional text-image retrieval tasks (6% improvement). Additionally, Long-CLIP can enhance image generation from detailed text descriptions by replacing CLIP in a plug-and-play manner. The model is released at https://github.com/beichenzbc/Long-CLIP. Long-CLIP aligns the CLIP latent space, making it readily replaceable in downstream frameworks without further adaptation. The model's ability to handle long texts improves both image-text retrieval and text-to-image generation tasks. Long-CLIP's performance surpasses CLIP in both short and long text tasks, with no degradation in zero-shot classification. The model's effectiveness is demonstrated through experiments on various datasets, including the newly created Urban-1k dataset. Long-CLIP's approach allows for seamless integration into existing models and supports both short and long text inputs, enhancing the model's overall capabilities.Long-CLIP is a modified version of the Contrastive Language-Image Pre-training (CLIP) model that enables it to handle long text inputs. CLIP has been widely used for zero-shot classification, text-image retrieval, and text-to-image generation by aligning image and text modalities. However, CLIP's text input is limited to 77 tokens, which restricts its ability to process detailed descriptions. Long-CLIP addresses this limitation by supporting long-text input while retaining or even surpassing CLIP's zero-shot generalizability. It achieves this through two novel strategies: (1) knowledge-preserved stretching of positional embeddings and (2) primary component matching of CLIP features. Long-CLIP is trained using just one million long text-image pairs and has shown significant improvements in long caption text-image retrieval (20% improvement) and traditional text-image retrieval tasks (6% improvement). Additionally, Long-CLIP can enhance image generation from detailed text descriptions by replacing CLIP in a plug-and-play manner. The model is released at https://github.com/beichenzbc/Long-CLIP. Long-CLIP aligns the CLIP latent space, making it readily replaceable in downstream frameworks without further adaptation. The model's ability to handle long texts improves both image-text retrieval and text-to-image generation tasks. Long-CLIP's performance surpasses CLIP in both short and long text tasks, with no degradation in zero-shot classification. The model's effectiveness is demonstrated through experiments on various datasets, including the newly created Urban-1k dataset. Long-CLIP's approach allows for seamless integration into existing models and supports both short and long text inputs, enhancing the model's overall capabilities.
Reach us at info@study.space