[slides and audio] Long-CLIP%3A Unlocking the Long-Text Capability of CLIP

**Long-CLIP: Unlocking the Long-Text Capability of CLIP** **Authors:** Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang **Institution:** Shanghai AI Laboratory, Shanghai Jiao Tong University, The Chinese University of Hong Kong **GitHub:** https://github.com/beichenzbc/Long-CLIP **Abstract:** CLIP, a cornerstone for zero-shot classification, text-image retrieval, and text-to-image generation, is limited by its 77-token text input length. This paper introduces Long-CLIP, a plug-and-play alternative that supports long-text input while retaining or surpassing CLIP's zero-shot generalizability. Long-CLIP addresses the limitations of CLIP through two novel strategies: knowledge-preserved stretching of positional embedding and primary component matching of CLIP features. With just 1 million extra long text-image pairs, Long-CLIP outperforms CLIP by 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks. It also enhances image generation from detailed text descriptions. **Keywords:** Multimodality, Zero-shot Image Classification, Text-Image Retrieval, Text-to-Image Generation **Introduction:** CLIP's text encoder uses an absolute positional embedding limited to 77 tokens, restricting its ability to handle detailed descriptions. This paper proposes Long-CLIP to address this limitation by relaxing the input text length and fine-tuning CLIP with long text-image pairs. The key contributions include: 1. **Knowledge-Preserved Stretching:** Retains the well-trained positional embedding for the first 20 tokens and interpolates the rest with a larger ratio. 2. **Primary Component Matching:** Aligns fine-grained and coarse-grained image features with long and short captions, respectively. **Methods:** - **Exploring the Effective Length of CLIP:** Empirical studies show that CLIP's effective text length is only 20 tokens. - **Knowledge-Preserved Stretching:** Interpolates positional embedding to support longer input lengths while minimizing disruption to well-trained positions. - **Primary Component Matching:** Extracts coarse-grained image features and aligns them with short captions to maintain short-text capability. **Experiments:** - **Evaluation Datasets:** ImageNet-1K, COCO2017, Flickr30k, ShareGPT4V. - **Evaluation Settings:** Class template, input token truncation. - **Training Settings:** ShareGPT4V dataset with 1M (long caption, image) pairs, 1 epoch, batch size 2048. **Results:** - **Long Caption Text-Image Retrieval:** Significantly improves recall rate by 25% on long-text tasks and 6% on short-text tasks. - **Zero-Shot Image Classification:** No significant performance degradation. - ****Long-CLIP: Unlocking the Long-Text Capability of CLIP** **Authors:** Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang **Institution:** Shanghai AI Laboratory, Shanghai Jiao Tong University, The Chinese University of Hong Kong **GitHub:** https://github.com/beichenzbc/Long-CLIP **Abstract:** CLIP, a cornerstone for zero-shot classification, text-image retrieval, and text-to-image generation, is limited by its 77-token text input length. This paper introduces Long-CLIP, a plug-and-play alternative that supports long-text input while retaining or surpassing CLIP's zero-shot generalizability. Long-CLIP addresses the limitations of CLIP through two novel strategies: knowledge-preserved stretching of positional embedding and primary component matching of CLIP features. With just 1 million extra long text-image pairs, Long-CLIP outperforms CLIP by 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks. It also enhances image generation from detailed text descriptions. **Keywords:** Multimodality, Zero-shot Image Classification, Text-Image Retrieval, Text-to-Image Generation **Introduction:** CLIP's text encoder uses an absolute positional embedding limited to 77 tokens, restricting its ability to handle detailed descriptions. This paper proposes Long-CLIP to address this limitation by relaxing the input text length and fine-tuning CLIP with long text-image pairs. The key contributions include: 1. **Knowledge-Preserved Stretching:** Retains the well-trained positional embedding for the first 20 tokens and interpolates the rest with a larger ratio. 2. **Primary Component Matching:** Aligns fine-grained and coarse-grained image features with long and short captions, respectively. **Methods:** - **Exploring the Effective Length of CLIP:** Empirical studies show that CLIP's effective text length is only 20 tokens. - **Knowledge-Preserved Stretching:** Interpolates positional embedding to support longer input lengths while minimizing disruption to well-trained positions. - **Primary Component Matching:** Extracts coarse-grained image features and aligns them with short captions to maintain short-text capability. **Experiments:** - **Evaluation Datasets:** ImageNet-1K, COCO2017, Flickr30k, ShareGPT4V. - **Evaluation Settings:** Class template, input token truncation. - **Training Settings:** ShareGPT4V dataset with 1M (long caption, image) pairs, 1 epoch, batch size 2048. **Results:** - **Long Caption Text-Image Retrieval:** Significantly improves recall rate by 25% on long-text tasks and 6% on short-text tasks. - **Zero-Shot Image Classification:** No significant performance degradation. - **

Long-CLIP: Unlocking the Long-Text Capability of CLIP

22 Jul 2024 | Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang