Understanding Conditional Prompt Learning for Vision-Language Models

The paper "Conditional Prompt Learning for Vision-Language Models" by Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu addresses the challenge of adapting powerful pre-trained vision-language models, such as CLIP, to downstream datasets. The authors introduce a method called Context Optimization (CoOp), which leverages prompt learning from natural language processing to adapt these models. CoOp turns context words in prompts into learnable vectors, improving performance with only a few labeled images. However, the authors identify a critical issue with CoOp: the learned context is not generalizable to unseen classes within the same dataset, suggesting overfitting to base classes. To address this problem, the authors propose Conditional Context Optimization (CoCoOp), which extends CoOp by learning a lightweight neural network to generate input-conditional tokens for each image. This dynamic approach adapts prompts to each instance, making them less sensitive to class shifts. Extensive experiments show that CoCoOp generalizes better to unseen classes and exhibits stronger domain generalization performance compared to CoOp. The paper also discusses related work, including vision-language models, prompt learning, and zero-shot learning, and provides a detailed methodology and experimental setup. The results demonstrate the effectiveness of CoCoOp in various problem scenarios, including generalization from base to new classes, cross-dataset transfer, and domain generalization.The paper "Conditional Prompt Learning for Vision-Language Models" by Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu addresses the challenge of adapting powerful pre-trained vision-language models, such as CLIP, to downstream datasets. The authors introduce a method called Context Optimization (CoOp), which leverages prompt learning from natural language processing to adapt these models. CoOp turns context words in prompts into learnable vectors, improving performance with only a few labeled images. However, the authors identify a critical issue with CoOp: the learned context is not generalizable to unseen classes within the same dataset, suggesting overfitting to base classes. To address this problem, the authors propose Conditional Context Optimization (CoCoOp), which extends CoOp by learning a lightweight neural network to generate input-conditional tokens for each image. This dynamic approach adapts prompts to each instance, making them less sensitive to class shifts. Extensive experiments show that CoCoOp generalizes better to unseen classes and exhibits stronger domain generalization performance compared to CoOp. The paper also discusses related work, including vision-language models, prompt learning, and zero-shot learning, and provides a detailed methodology and experimental setup. The results demonstrate the effectiveness of CoCoOp in various problem scenarios, including generalization from base to new classes, cross-dataset transfer, and domain generalization.

Conditional Prompt Learning for Vision-Language Models

6 Oct 2022 | Kaiyang Zhou Jingkang Yang Chen Change Loy Ziwei Liu