Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations

Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations

12 Mar 2024 | Chenyu You, Yifei Min, Weicheng Dai, Jasjeet S. Sekhon, Lawrence Staib, James S. Duncan
This paper presents a method for calibrating multi-modal representations in pre-trained vision-language models (VLMs) like CLIP to enhance group robustness without requiring group annotations. The authors investigate the presence of spurious correlations in CLIP and CLIP+ERM, which can lead to biased predictions for underrepresented groups. They propose a lightweight representation calibration method called Contrastive Feature Recalibration (CFR), which uses contrastive learning to refine the representations of samples in a calibration set generated from the pre-trained model. This approach does not require group labels and is designed to reduce reliance on spurious features while improving model generalization. The method is validated through extensive experiments on multiple benchmarks, showing significant improvements in group robustness compared to existing methods. CFR achieves this by recalibrating the feature representations of samples in the calibration set, aligning them closer to the centroid of their designated class and distancing them from opposing class centroids. The results demonstrate that CFR outperforms semi-supervised and supervised methods in terms of worst-group accuracy (WGA) and generalization. The method is efficient and parameter-efficient, making it suitable for real-world deployment without the need for group annotations. The paper also discusses the importance of feature recalibration in improving robustness and highlights the effectiveness of using language attributes in vision classifiers to enhance group robustness. Overall, the study provides a novel approach to mitigating spurious correlations in pre-trained VLMs, offering a practical solution for improving fairness and robustness in real-world applications.This paper presents a method for calibrating multi-modal representations in pre-trained vision-language models (VLMs) like CLIP to enhance group robustness without requiring group annotations. The authors investigate the presence of spurious correlations in CLIP and CLIP+ERM, which can lead to biased predictions for underrepresented groups. They propose a lightweight representation calibration method called Contrastive Feature Recalibration (CFR), which uses contrastive learning to refine the representations of samples in a calibration set generated from the pre-trained model. This approach does not require group labels and is designed to reduce reliance on spurious features while improving model generalization. The method is validated through extensive experiments on multiple benchmarks, showing significant improvements in group robustness compared to existing methods. CFR achieves this by recalibrating the feature representations of samples in the calibration set, aligning them closer to the centroid of their designated class and distancing them from opposing class centroids. The results demonstrate that CFR outperforms semi-supervised and supervised methods in terms of worst-group accuracy (WGA) and generalization. The method is efficient and parameter-efficient, making it suitable for real-world deployment without the need for group annotations. The paper also discusses the importance of feature recalibration in improving robustness and highlights the effectiveness of using language attributes in vision classifiers to enhance group robustness. Overall, the study provides a novel approach to mitigating spurious correlations in pre-trained VLMs, offering a practical solution for improving fairness and robustness in real-world applications.
Reach us at info@study.space