A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)

A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)

2024-02-12 | Weijie Tu, Weijian Deng, Tom Gedeon
This paper investigates the robustness of Contrastive Language-Image Pre-training (CLIP) models across various visual factors, out-of-distribution (OOD) detection, and predictive uncertainty. The study evaluates 83 CLIP models and 127 ImageNet classifiers, considering 10 visual factors, 5 types of OOD data, and 8 test conditions. Key findings include that CLIP models are generally more robust than ImageNet classifiers on six out of ten visual factors, but less robust on factors like object pose. CLIP models show a shape bias, which diminishes after fine-tuning on ImageNet. Training distribution significantly affects CLIP's robustness to visual factors. CLIP models trained on LAION show higher robustness on shape factors than those trained on WIT. CLIP models trained on different subsets of LAION follow similar trends. CLIP fine-tuned models perform slightly better than more data pre-trained models on certain factors. CLIP models exhibit texture bias after fine-tuning, which weakens shape bias. CLIP models trained on larger input resolutions show stronger texture bias. CLIP models trained on WIT show better OOD detection performance than those trained on LAION. Fine-tuning on ImageNet-12K improves OOD detection performance. CLIP models are not consistently more calibrated than other ImageNet models, contradicting previous findings. Temperature scaling improves calibration performance of CLIP models. CLIP models show better calibration after ID calibration. Test time prompts affect CLIP's performance on three safety objectives. CLIP models trained on WIT show better performance on OOD detection and calibration than those trained on LAION. The study highlights the importance of training source design for CLIP models. The research provides insights into the robustness, OOD detection, and calibration of CLIP models, emphasizing the need for careful training source design to improve their performance. The study also suggests that further research is needed to explore the impact of different training sources and fine-tuning procedures on CLIP models.This paper investigates the robustness of Contrastive Language-Image Pre-training (CLIP) models across various visual factors, out-of-distribution (OOD) detection, and predictive uncertainty. The study evaluates 83 CLIP models and 127 ImageNet classifiers, considering 10 visual factors, 5 types of OOD data, and 8 test conditions. Key findings include that CLIP models are generally more robust than ImageNet classifiers on six out of ten visual factors, but less robust on factors like object pose. CLIP models show a shape bias, which diminishes after fine-tuning on ImageNet. Training distribution significantly affects CLIP's robustness to visual factors. CLIP models trained on LAION show higher robustness on shape factors than those trained on WIT. CLIP models trained on different subsets of LAION follow similar trends. CLIP fine-tuned models perform slightly better than more data pre-trained models on certain factors. CLIP models exhibit texture bias after fine-tuning, which weakens shape bias. CLIP models trained on larger input resolutions show stronger texture bias. CLIP models trained on WIT show better OOD detection performance than those trained on LAION. Fine-tuning on ImageNet-12K improves OOD detection performance. CLIP models are not consistently more calibrated than other ImageNet models, contradicting previous findings. Temperature scaling improves calibration performance of CLIP models. CLIP models show better calibration after ID calibration. Test time prompts affect CLIP's performance on three safety objectives. CLIP models trained on WIT show better performance on OOD detection and calibration than those trained on LAION. The study highlights the importance of training source design for CLIP models. The research provides insights into the robustness, OOD detection, and calibration of CLIP models, emphasizing the need for careful training source design to improve their performance. The study also suggests that further research is needed to explore the impact of different training sources and fine-tuning procedures on CLIP models.
Reach us at info@study.space
[slides and audio] A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)