This paper addresses the issue of overfitting in vision-language models (VLMs) during fine-tuning for out-of-distribution (OOD) generalization. Existing VLMs, such as CLIP, excel in zero-shot recognition but struggle with OOD generalization due to their closed-set evaluation. Recent methods like prompt learning and adaptor tuning have shown improvements in both in-distribution (ID) and OOD accuracy, but overfitting remains a challenge. The paper proposes OGEN, a novel approach that improves OOD generalization by synthesizing OOD features using class names and introducing an adaptive self-distillation mechanism to regularize the model.
The key contributions of OGEN include: 1) a class-conditional feature generator that synthesizes OOD features for effective regularization, and 2) an adaptive self-distillation method that reduces overfitting during joint optimization. The feature generator leverages CLIP's aligned image-text feature spaces to synthesize image features for unknown classes, enabling the model to learn a more reliable decision boundary between known and unknown classes. The adaptive self-distillation mechanism further enhances regularization by transferring knowledge between model states, reducing overfitting.
Experiments show that OGEN consistently improves OOD generalization performance across various settings, including within-dataset and cross-dataset generalization. The method achieves significant gains in OOD accuracy, with up to an 18.77% improvement in some cases. OGEN is applicable to different fine-tuning methods and demonstrates superior generalization capabilities. The approach is model-agnostic and can be applied to other vision-language models. The results validate the effectiveness of OGEN in reducing overfitting and improving OOD generalization for VLMs.This paper addresses the issue of overfitting in vision-language models (VLMs) during fine-tuning for out-of-distribution (OOD) generalization. Existing VLMs, such as CLIP, excel in zero-shot recognition but struggle with OOD generalization due to their closed-set evaluation. Recent methods like prompt learning and adaptor tuning have shown improvements in both in-distribution (ID) and OOD accuracy, but overfitting remains a challenge. The paper proposes OGEN, a novel approach that improves OOD generalization by synthesizing OOD features using class names and introducing an adaptive self-distillation mechanism to regularize the model.
The key contributions of OGEN include: 1) a class-conditional feature generator that synthesizes OOD features for effective regularization, and 2) an adaptive self-distillation method that reduces overfitting during joint optimization. The feature generator leverages CLIP's aligned image-text feature spaces to synthesize image features for unknown classes, enabling the model to learn a more reliable decision boundary between known and unknown classes. The adaptive self-distillation mechanism further enhances regularization by transferring knowledge between model states, reducing overfitting.
Experiments show that OGEN consistently improves OOD generalization performance across various settings, including within-dataset and cross-dataset generalization. The method achieves significant gains in OOD accuracy, with up to an 18.77% improvement in some cases. OGEN is applicable to different fine-tuning methods and demonstrates superior generalization capabilities. The approach is model-agnostic and can be applied to other vision-language models. The results validate the effectiveness of OGEN in reducing overfitting and improving OOD generalization for VLMs.