EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

6 Feb 2024 | Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Xinlong Wang
EVA-CLIP-18B is the largest and most powerful open-source CLIP model with 18 billion parameters. It achieves an exceptional 80.7% zero-shot top-1 accuracy across 27 image classification benchmarks, outperforming its predecessor EVA-CLIP (5 billion parameters) and other open-source CLIP models. The model is trained on a dataset of 2 billion image-text pairs from LAION-2B and COYO-700M, with only 6 billion training samples. Despite maintaining a constant training dataset, EVA-CLIP-18B shows consistent performance improvements as model size increases. The model demonstrates the potential of EVA-style weak-to-strong visual model scaling. EVA-CLIP-18B is trained using a pre-trained EVA-18B model as a teacher, which is then used to pre-train EVA-CLIP. The model is trained with a large-scale dataset and achieves strong performance on image, video, and image-text retrieval tasks. It also performs well on 3D representation learning and robustness tests. The model's training settings include using the LAMB optimizer, a cosine learning rate schedule, and various techniques to optimize training. The model's performance is evaluated on 33 datasets, including image, video, and image-text retrieval benchmarks. EVA-CLIP-18B outperforms other CLIP models in zero-shot classification, video classification, and image-text retrieval tasks. The model's performance is robust across different image transformations and shows consistent improvements with scaling. The model's training code and weights are publicly available to facilitate further research in vision and multimodal foundation models.EVA-CLIP-18B is the largest and most powerful open-source CLIP model with 18 billion parameters. It achieves an exceptional 80.7% zero-shot top-1 accuracy across 27 image classification benchmarks, outperforming its predecessor EVA-CLIP (5 billion parameters) and other open-source CLIP models. The model is trained on a dataset of 2 billion image-text pairs from LAION-2B and COYO-700M, with only 6 billion training samples. Despite maintaining a constant training dataset, EVA-CLIP-18B shows consistent performance improvements as model size increases. The model demonstrates the potential of EVA-style weak-to-strong visual model scaling. EVA-CLIP-18B is trained using a pre-trained EVA-18B model as a teacher, which is then used to pre-train EVA-CLIP. The model is trained with a large-scale dataset and achieves strong performance on image, video, and image-text retrieval tasks. It also performs well on 3D representation learning and robustness tests. The model's training settings include using the LAMB optimizer, a cosine learning rate schedule, and various techniques to optimize training. The model's performance is evaluated on 33 datasets, including image, video, and image-text retrieval benchmarks. EVA-CLIP-18B outperforms other CLIP models in zero-shot classification, video classification, and image-text retrieval tasks. The model's performance is robust across different image transformations and shows consistent improvements with scaling. The model's training code and weights are publicly available to facilitate further research in vision and multimodal foundation models.
Reach us at info@study.space