2024-01-15 | Yi Zhang, Ce Zhang, Ke Yu, Yushun Tang, Zhihai He
This paper proposes Concept-Guided Prompt Learning (CPL) for vision-language models (VLMs) to enhance generalization capabilities. The main idea is to incorporate visual concepts into the prompt learning process to improve the model's ability to generalize across different domains and tasks. The authors leverage the well-learned knowledge of CLIP to create a visual concept cache, which enables concept-guided prompting. They also develop a projector that transforms multi-level visual features into text features, enhancing the consistency between visual and linguistic modalities.
The proposed CPL method outperforms existing state-of-the-art methods in various tasks, including base-to-novel generalization, cross-dataset transfer, and domain generalization. The method is evaluated on multiple datasets, including ImageNet, Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, UCF101, DTD, and EuroSAT. The results show that CPL achieves significantly higher accuracy compared to methods like CoOp, CoCoOp, ProGrad, MaPLe, and KgCoOp.
The CPL method is also tested on domain generalization tasks, where it is evaluated on ImageNet variants such as ImageNet-A, ImageNet-R, ImageNet-V2, and ImageNet-Sketch. The results demonstrate that CPL is robust to distribution shifts and outperforms other methods in terms of classification accuracy.
The authors also conduct ablation studies to analyze the contributions of different components of the CPL method. The results show that concept-guided prompting brings the largest performance improvement, followed by the projector and task adapter. The method is efficient, requiring only 50 minutes of training time, while other methods require more than 14 hours.
In conclusion, the proposed CPL method significantly improves the generalization capabilities of VLMs by incorporating visual concepts into the prompt learning process. The method is effective in various tasks and outperforms existing state-of-the-art methods in terms of accuracy and efficiency.This paper proposes Concept-Guided Prompt Learning (CPL) for vision-language models (VLMs) to enhance generalization capabilities. The main idea is to incorporate visual concepts into the prompt learning process to improve the model's ability to generalize across different domains and tasks. The authors leverage the well-learned knowledge of CLIP to create a visual concept cache, which enables concept-guided prompting. They also develop a projector that transforms multi-level visual features into text features, enhancing the consistency between visual and linguistic modalities.
The proposed CPL method outperforms existing state-of-the-art methods in various tasks, including base-to-novel generalization, cross-dataset transfer, and domain generalization. The method is evaluated on multiple datasets, including ImageNet, Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, UCF101, DTD, and EuroSAT. The results show that CPL achieves significantly higher accuracy compared to methods like CoOp, CoCoOp, ProGrad, MaPLe, and KgCoOp.
The CPL method is also tested on domain generalization tasks, where it is evaluated on ImageNet variants such as ImageNet-A, ImageNet-R, ImageNet-V2, and ImageNet-Sketch. The results demonstrate that CPL is robust to distribution shifts and outperforms other methods in terms of classification accuracy.
The authors also conduct ablation studies to analyze the contributions of different components of the CPL method. The results show that concept-guided prompting brings the largest performance improvement, followed by the projector and task adapter. The method is efficient, requiring only 50 minutes of training time, while other methods require more than 14 hours.
In conclusion, the proposed CPL method significantly improves the generalization capabilities of VLMs by incorporating visual concepts into the prompt learning process. The method is effective in various tasks and outperforms existing state-of-the-art methods in terms of accuracy and efficiency.