Concept-Guided Prompt Learning for Generalization in Vision-Language Models

Concept-Guided Prompt Learning for Generalization in Vision-Language Models

15 Jan 2024 | Yi Zhang, Ce Zhang, Ke Yu, Yushun Tang, Zhihai He
The paper "Concept-Guided Prompt Learning for Generalization in Vision-Language Models" addresses the issue of generalization in vision-language models (VLMs) by proposing a novel method called Concept-Guided Prompt Learning (CPL). The authors identify that current fine-tuning methods for CLIP, such as CoOp and CoCoOp, often perform poorly on fine-grained datasets due to their focus on global features and neglect of low-level visual concepts like colors, shapes, and sizes. To address this, CPL leverages the well-learned knowledge of CLIP to create a visual concept cache, enabling concept-guided prompting. Additionally, a projector is developed to refine text features by transforming multi-level visual features into text features. Extensive experiments demonstrate that CPL significantly improves generalization capabilities compared to state-of-the-art methods on various tasks, including base-to-novel generalization, cross-dataset transfer, and domain generalization. The method's effectiveness is validated through comprehensive empirical results, showing superior performance in enhancing the consistency between visual and linguistic modalities.The paper "Concept-Guided Prompt Learning for Generalization in Vision-Language Models" addresses the issue of generalization in vision-language models (VLMs) by proposing a novel method called Concept-Guided Prompt Learning (CPL). The authors identify that current fine-tuning methods for CLIP, such as CoOp and CoCoOp, often perform poorly on fine-grained datasets due to their focus on global features and neglect of low-level visual concepts like colors, shapes, and sizes. To address this, CPL leverages the well-learned knowledge of CLIP to create a visual concept cache, enabling concept-guided prompting. Additionally, a projector is developed to refine text features by transforming multi-level visual features into text features. Extensive experiments demonstrate that CPL significantly improves generalization capabilities compared to state-of-the-art methods on various tasks, including base-to-novel generalization, cross-dataset transfer, and domain generalization. The method's effectiveness is validated through comprehensive empirical results, showing superior performance in enhancing the consistency between visual and linguistic modalities.
Reach us at info@study.space
[slides and audio] Concept-Guided Prompt Learning for Generalization in Vision-Language Models