This paper investigates the robustness of Contrastive Language-Image Pre-training (CLIP) models across three key safety objectives: resilience to visual factor variations, calibrated uncertainty estimations, and the ability to detect anomalous inputs. The study involves 83 CLIP models and 127 ImageNet classifiers, examining 10 visual factors, 5 types of out-of-distribution (OOD) data, and 8 natural and challenging test conditions. Key findings include:
1. **Visual Factor-Level Robustness**: CLIP models generally exhibit better factor-level robustness than other models on six out of ten visual factors but are less robust on factors like object pose.
2. **Calibrated Uncertainty Estimations**: CLIP models are not consistently more calibrated than other ImageNet models, contrary to existing findings. The effectiveness of calibration depends on the training data distribution and quantity.
3. **Out-of-Distribution Detection**: CLIP models trained on the same source show a strong correlation between in-distribution (ID) accuracy and OOD detection performance. The choice of training source and fine-tuning procedure significantly influence OOD detection.
4. **Predictive Uncertainty**: CLIP models maintain reasonable uncertainty estimates under distribution shifts after ID calibration with temperature scaling.
The study highlights the importance of training source design and suggests that CLIP models can be more robust and reliable in real-world applications by carefully selecting training sources and fine-tuning procedures.This paper investigates the robustness of Contrastive Language-Image Pre-training (CLIP) models across three key safety objectives: resilience to visual factor variations, calibrated uncertainty estimations, and the ability to detect anomalous inputs. The study involves 83 CLIP models and 127 ImageNet classifiers, examining 10 visual factors, 5 types of out-of-distribution (OOD) data, and 8 natural and challenging test conditions. Key findings include:
1. **Visual Factor-Level Robustness**: CLIP models generally exhibit better factor-level robustness than other models on six out of ten visual factors but are less robust on factors like object pose.
2. **Calibrated Uncertainty Estimations**: CLIP models are not consistently more calibrated than other ImageNet models, contrary to existing findings. The effectiveness of calibration depends on the training data distribution and quantity.
3. **Out-of-Distribution Detection**: CLIP models trained on the same source show a strong correlation between in-distribution (ID) accuracy and OOD detection performance. The choice of training source and fine-tuning procedure significantly influence OOD detection.
4. **Predictive Uncertainty**: CLIP models maintain reasonable uncertainty estimates under distribution shifts after ID calibration with temperature scaling.
The study highlights the importance of training source design and suggests that CLIP models can be more robust and reliable in real-world applications by carefully selecting training sources and fine-tuning procedures.