**RWKV-CLIP: A Robust Vision-Language Representation Learner**
This paper explores the perspectives of data and model architecture in Contrastive Language-Image Pre-training (CLIP) to enhance its performance in various vision-language tasks. To address the prevalence of noisy data, the authors introduce a diverse description generation framework that leverages Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. This framework aims to produce more accurate and semantically enriched descriptions.
The authors propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model. It combines the effective parallel training of transformers with the efficient inference of RNNs. Extensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval.
Key contributions include:
1. A diverse description generation framework that enhances the quality of web-based image-text pairs.
2. RWKV-CLIP, the first RWKV-driven vision-language representation learning model.
3. Robust and efficient performance across various model scales and pre-training datasets.
The paper also includes detailed experimental settings, results, and ablation studies to validate the effectiveness and robustness of RWKV-CLIP. The code and pre-trained models are released to facilitate future research.**RWKV-CLIP: A Robust Vision-Language Representation Learner**
This paper explores the perspectives of data and model architecture in Contrastive Language-Image Pre-training (CLIP) to enhance its performance in various vision-language tasks. To address the prevalence of noisy data, the authors introduce a diverse description generation framework that leverages Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. This framework aims to produce more accurate and semantically enriched descriptions.
The authors propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model. It combines the effective parallel training of transformers with the efficient inference of RNNs. Extensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval.
Key contributions include:
1. A diverse description generation framework that enhances the quality of web-based image-text pairs.
2. RWKV-CLIP, the first RWKV-driven vision-language representation learning model.
3. Robust and efficient performance across various model scales and pre-training datasets.
The paper also includes detailed experimental settings, results, and ablation studies to validate the effectiveness and robustness of RWKV-CLIP. The code and pre-trained models are released to facilitate future research.