RWKV-CLIP: A Robust Vision-Language Representation Learner

RWKV-CLIP: A Robust Vision-Language Representation Learner

11 Jun 2024 | Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng
RWKV-CLIP is a robust vision-language representation learner that improves upon the Contrastive Language-Image Pre-training (CLIP) framework by addressing data quality issues and enhancing model architecture. The paper introduces a diverse description generation framework that leverages Large Language Models (LLMs) to synthesize and refine information from web-based texts, synthetic captions, and detection tags. This framework produces more accurate and semantically enriched descriptions, which are then used to train RWKV-CLIP. RWKV-CLIP combines the effective parallel training of transformers with the efficient inference of RNNs, making it a powerful and efficient model for vision-language tasks. Comprehensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. The code and pre-trained models are available for further research. The paper also conducts extensive ablation studies to evaluate the effectiveness of the model and data scaling, as well as the impact of different text types and model architectures. The results show that RWKV-CLIP outperforms existing models in terms of robustness, accuracy, and efficiency, making it a promising approach for vision-language representation learning.RWKV-CLIP is a robust vision-language representation learner that improves upon the Contrastive Language-Image Pre-training (CLIP) framework by addressing data quality issues and enhancing model architecture. The paper introduces a diverse description generation framework that leverages Large Language Models (LLMs) to synthesize and refine information from web-based texts, synthetic captions, and detection tags. This framework produces more accurate and semantically enriched descriptions, which are then used to train RWKV-CLIP. RWKV-CLIP combines the effective parallel training of transformers with the efficient inference of RNNs, making it a powerful and efficient model for vision-language tasks. Comprehensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. The code and pre-trained models are available for further research. The paper also conducts extensive ablation studies to evaluate the effectiveness of the model and data scaling, as well as the impact of different text types and model architectures. The results show that RWKV-CLIP outperforms existing models in terms of robustness, accuracy, and efficiency, making it a promising approach for vision-language representation learning.
Reach us at info@study.space
Understanding RWKV-CLIP%3A A Robust Vision-Language Representation Learner