22 Jul 2024 | Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang
**Long-CLIP: Unlocking the Long-Text Capability of CLIP**
**Authors:** Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang
**Institution:** Shanghai AI Laboratory, Shanghai Jiao Tong University, The Chinese University of Hong Kong
**GitHub:** https://github.com/beichenzbc/Long-CLIP
**Abstract:**
CLIP, a cornerstone for zero-shot classification, text-image retrieval, and text-to-image generation, is limited by its 77-token text input length. This paper introduces Long-CLIP, a plug-and-play alternative that supports long-text input while retaining or surpassing CLIP's zero-shot generalizability. Long-CLIP addresses the limitations of CLIP through two novel strategies: knowledge-preserved stretching of positional embedding and primary component matching of CLIP features. With just 1 million extra long text-image pairs, Long-CLIP outperforms CLIP by 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks. It also enhances image generation from detailed text descriptions.
**Keywords:** Multimodality, Zero-shot Image Classification, Text-Image Retrieval, Text-to-Image Generation
**Introduction:**
CLIP's text encoder uses an absolute positional embedding limited to 77 tokens, restricting its ability to handle detailed descriptions. This paper proposes Long-CLIP to address this limitation by relaxing the input text length and fine-tuning CLIP with long text-image pairs. The key contributions include:
1. **Knowledge-Preserved Stretching:** Retains the well-trained positional embedding for the first 20 tokens and interpolates the rest with a larger ratio.
2. **Primary Component Matching:** Aligns fine-grained and coarse-grained image features with long and short captions, respectively.
**Methods:**
- **Exploring the Effective Length of CLIP:** Empirical studies show that CLIP's effective text length is only 20 tokens.
- **Knowledge-Preserved Stretching:** Interpolates positional embedding to support longer input lengths while minimizing disruption to well-trained positions.
- **Primary Component Matching:** Extracts coarse-grained image features and aligns them with short captions to maintain short-text capability.
**Experiments:**
- **Evaluation Datasets:** ImageNet-1K, COCO2017, Flickr30k, ShareGPT4V.
- **Evaluation Settings:** Class template, input token truncation.
- **Training Settings:** ShareGPT4V dataset with 1M (long caption, image) pairs, 1 epoch, batch size 2048.
**Results:**
- **Long Caption Text-Image Retrieval:** Significantly improves recall rate by 25% on long-text tasks and 6% on short-text tasks.
- **Zero-Shot Image Classification:** No significant performance degradation.
- ****Long-CLIP: Unlocking the Long-Text Capability of CLIP**
**Authors:** Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang
**Institution:** Shanghai AI Laboratory, Shanghai Jiao Tong University, The Chinese University of Hong Kong
**GitHub:** https://github.com/beichenzbc/Long-CLIP
**Abstract:**
CLIP, a cornerstone for zero-shot classification, text-image retrieval, and text-to-image generation, is limited by its 77-token text input length. This paper introduces Long-CLIP, a plug-and-play alternative that supports long-text input while retaining or surpassing CLIP's zero-shot generalizability. Long-CLIP addresses the limitations of CLIP through two novel strategies: knowledge-preserved stretching of positional embedding and primary component matching of CLIP features. With just 1 million extra long text-image pairs, Long-CLIP outperforms CLIP by 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks. It also enhances image generation from detailed text descriptions.
**Keywords:** Multimodality, Zero-shot Image Classification, Text-Image Retrieval, Text-to-Image Generation
**Introduction:**
CLIP's text encoder uses an absolute positional embedding limited to 77 tokens, restricting its ability to handle detailed descriptions. This paper proposes Long-CLIP to address this limitation by relaxing the input text length and fine-tuning CLIP with long text-image pairs. The key contributions include:
1. **Knowledge-Preserved Stretching:** Retains the well-trained positional embedding for the first 20 tokens and interpolates the rest with a larger ratio.
2. **Primary Component Matching:** Aligns fine-grained and coarse-grained image features with long and short captions, respectively.
**Methods:**
- **Exploring the Effective Length of CLIP:** Empirical studies show that CLIP's effective text length is only 20 tokens.
- **Knowledge-Preserved Stretching:** Interpolates positional embedding to support longer input lengths while minimizing disruption to well-trained positions.
- **Primary Component Matching:** Extracts coarse-grained image features and aligns them with short captions to maintain short-text capability.
**Experiments:**
- **Evaluation Datasets:** ImageNet-1K, COCO2017, Flickr30k, ShareGPT4V.
- **Evaluation Settings:** Class template, input token truncation.
- **Training Settings:** ShareGPT4V dataset with 1M (long caption, image) pairs, 1 epoch, batch size 2048.
**Results:**
- **Long Caption Text-Image Retrieval:** Significantly improves recall rate by 25% on long-text tasks and 6% on short-text tasks.
- **Zero-Shot Image Classification:** No significant performance degradation.
- **