Understanding Learning to Prompt for Vision-Language Models

The paper "Learning to Prompt for Vision-Language Models" by Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu addresses the challenge of prompt engineering in vision-language models, which is time-consuming and requires domain expertise. The authors propose *Context Optimization (CoOp)*, a method that models the context words of prompts with learnable vectors while keeping the pre-trained parameters fixed. CoOp is designed to automate the process of prompt engineering, making it more efficient and less dependent on manual tuning. The paper introduces two implementations of CoOp: unified context and class-specific context, each tailored to different types of image recognition tasks. Extensive experiments on 11 datasets demonstrate that CoOp can achieve significant improvements over hand-crafted prompts, with an average gain of around 15% using 16 shots, and outperforms the linear probe model in terms of downstream transfer learning performance and robustness under domain shifts. The authors also highlight the potential of prompt learning for large vision-language models, contributing to the democratization of foundation models.The paper "Learning to Prompt for Vision-Language Models" by Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu addresses the challenge of prompt engineering in vision-language models, which is time-consuming and requires domain expertise. The authors propose *Context Optimization (CoOp)*, a method that models the context words of prompts with learnable vectors while keeping the pre-trained parameters fixed. CoOp is designed to automate the process of prompt engineering, making it more efficient and less dependent on manual tuning. The paper introduces two implementations of CoOp: unified context and class-specific context, each tailored to different types of image recognition tasks. Extensive experiments on 11 datasets demonstrate that CoOp can achieve significant improvements over hand-crafted prompts, with an average gain of around 15% using 16 shots, and outperforms the linear probe model in terms of downstream transfer learning performance and robustness under domain shifts. The authors also highlight the potential of prompt learning for large vision-language models, contributing to the democratization of foundation models.

Learning to Prompt for Vision-Language Models

| Kaiyang Zhou · Jingkang Yang · Chen Change Loy · Ziwei Liu