RWKV-CLIP: A Robust Vision-Language Representation Learner

RWKV-CLIP: A Robust Vision-Language Representation Learner

11 Jun 2024 | Tiancheng Gu*, Kaicheng Yang*, Xiang An*, Ziyong Feng*, Dongnan Liu*, Weidong Cai*, Jiankang Deng*†
**RWKV-CLIP: A Robust Vision-Language Representation Learner** This paper explores the perspectives of data and model architecture in Contrastive Language-Image Pre-training (CLIP) to enhance its performance in various vision-language tasks. To address the prevalence of noisy data, the authors introduce a diverse description generation framework that leverages Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. This framework aims to produce more accurate and semantically enriched descriptions. The authors propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model. It combines the effective parallel training of transformers with the efficient inference of RNNs. Extensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. Key contributions include: 1. A diverse description generation framework that enhances the quality of web-based image-text pairs. 2. RWKV-CLIP, the first RWKV-driven vision-language representation learning model. 3. Robust and efficient performance across various model scales and pre-training datasets. The paper also includes detailed experimental settings, results, and ablation studies to validate the effectiveness and robustness of RWKV-CLIP. The code and pre-trained models are released to facilitate future research.**RWKV-CLIP: A Robust Vision-Language Representation Learner** This paper explores the perspectives of data and model architecture in Contrastive Language-Image Pre-training (CLIP) to enhance its performance in various vision-language tasks. To address the prevalence of noisy data, the authors introduce a diverse description generation framework that leverages Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. This framework aims to produce more accurate and semantically enriched descriptions. The authors propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model. It combines the effective parallel training of transformers with the efficient inference of RNNs. Extensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. Key contributions include: 1. A diverse description generation framework that enhances the quality of web-based image-text pairs. 2. RWKV-CLIP, the first RWKV-driven vision-language representation learning model. 3. Robust and efficient performance across various model scales and pre-training datasets. The paper also includes detailed experimental settings, results, and ablation studies to validate the effectiveness and robustness of RWKV-CLIP. The code and pre-trained models are released to facilitate future research.
Reach us at info@study.space