[slides and audio] Calibrating Multi-modal Representations%3A A Pursuit of Group Robustness without Annotations

This paper addresses the challenges of fine-tuning pre-trained vision-language models, such as CLIP, which often rely on spurious features and lack group robustness. The authors propose a method called Contrastive Feature Recalibration (CFR) to mitigate these issues without requiring group annotations. CFR involves two main steps: (1) forming a calibration set using the pre-trained CLIP and (2) calibrating the representations of samples within this set through contrastive learning. The method is evaluated on multiple benchmarks, demonstrating significant improvements in group robustness compared to existing methods. The authors also provide visualizations and ablation studies to support their findings, showing that CFR effectively reduces reliance on spurious correlations and enhances model generalization.This paper addresses the challenges of fine-tuning pre-trained vision-language models, such as CLIP, which often rely on spurious features and lack group robustness. The authors propose a method called Contrastive Feature Recalibration (CFR) to mitigate these issues without requiring group annotations. CFR involves two main steps: (1) forming a calibration set using the pre-trained CLIP and (2) calibrating the representations of samples within this set through contrastive learning. The method is evaluated on multiple benchmarks, demonstrating significant improvements in group robustness compared to existing methods. The authors also provide visualizations and ablation studies to support their findings, showing that CFR effectively reduces reliance on spurious correlations and enhances model generalization.

Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations

12 Mar 2024 | Chenyu You†, Yifei Min†, Weicheng Dai†, Jasjeet S. Sekhon, Lawrence Staib, James S. Duncan