[slides] EVA-CLIP-18B%3A Scaling CLIP to 18 Billion Parameters

The paper introduces EVA-CLIP-18B, the largest and most powerful open-source CLIP model with 18 billion parameters. This model achieves an exceptional 80.7% zero-shot top-1 accuracy across 27 widely recognized image classification benchmarks, outperforming its predecessor EVA-CLIP (5 billion parameters) and other open-source CLIP models. The model is trained on a constant dataset of 2 billion image-text pairs from LAION-2B and COYO-700M, demonstrating consistent performance improvement with model size scaling. The paper details the scaling procedure, which follows the weak-to-strong paradigm of EVA, starting with a large EVA model distilling knowledge from a small EVA-CLIP model. The training settings, evaluation metrics, and ablation studies are also discussed, highlighting the robustness and effectiveness of the model. The paper concludes by emphasizing the potential of EVA-style visual model scaling and the availability of the model weights to facilitate future research in vision and multimodal foundation models.The paper introduces EVA-CLIP-18B, the largest and most powerful open-source CLIP model with 18 billion parameters. This model achieves an exceptional 80.7% zero-shot top-1 accuracy across 27 widely recognized image classification benchmarks, outperforming its predecessor EVA-CLIP (5 billion parameters) and other open-source CLIP models. The model is trained on a constant dataset of 2 billion image-text pairs from LAION-2B and COYO-700M, demonstrating consistent performance improvement with model size scaling. The paper details the scaling procedure, which follows the weak-to-strong paradigm of EVA, starting with a large EVA model distilling knowledge from a small EVA-CLIP model. The training settings, evaluation metrics, and ablation studies are also discussed, highlighting the robustness and effectiveness of the model. The paper concludes by emphasizing the potential of EVA-style visual model scaling and the availability of the model weights to facilitate future research in vision and multimodal foundation models.

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

6 Feb 2024 | Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Xinlong Wang