| Kaiyang Zhou · Jingkang Yang · Chen Change Loy · Ziwei Liu
Learning to Prompt for Vision-Language Models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu
Abstract: Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Unlike traditional representation learning based on discretized labels, vision-language pre-training aligns images and texts in a common feature space, enabling zero-shot transfer via prompting. This work shows that prompt engineering is a major challenge for deploying such models, requiring domain expertise and time-consuming tuning. Inspired by recent advances in natural language processing (NLP) prompt learning, we propose Context Optimization (CoOp), a simple approach for adapting CLIP-like models for downstream image recognition. CoOp models prompt context words with learnable vectors while keeping pre-trained parameters fixed. Two versions are provided: unified context and class-specific context. Extensive experiments on 11 datasets show that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and achieves significant improvements with more shots. Despite being a learning-based approach, CoOp achieves superb domain generalization compared to zero-shot models using hand-crafted prompts.
Introduction: A common approach for building state-of-the-art visual recognition systems is to train vision models to predict for a fixed set of object categories using discrete labels. However, this approach limits visual recognition systems to closed-set visual concepts. Recently, vision-language pre-training such as CLIP and ALIGN has emerged as a promising alternative for visual representation learning. The main idea is to align images and raw texts using two separate encoders. For any new classification task, classification weights can be synthesized by giving sentences describing task-relevant categories to the text encoder. However, identifying the right prompt is a non-trivial task, often requiring significant time for words tuning. For pre-trained vision-language models, the text input, known as prompt, plays a key role in downstream datasets. However, identifying the right prompt is a non-trivial task, which often takes a significant amount of time for words tuning—a slight change in wording could make a huge difference in performance. Inspired by recent advances in prompt learning research in NLP, we propose Context Optimization (CoOp), a simple approach for automating prompt engineering specifically for pre-trained vision-language models. CoOp models a prompt's context words with learnable vectors while keeping the entire pre-trained parameters fixed. Two implementations are provided: unified context and class-specific context. During training, we minimize prediction errors using the cross-entropy loss with respect to the learnable context vectors while keeping the entire pre-trained parameters fixed. The gradients can be back-propagated all the way through the text encoder, distilling the rich knowledge encoded in the parameters for learning task-relevant context.
Experiments: We benchmark CoOp on 11 datasets, covering a diverse set of visual recognition tasks including classification on generic objects, scenes, actions and fine-grained categoriesLearning to Prompt for Vision-Language Models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu
Abstract: Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Unlike traditional representation learning based on discretized labels, vision-language pre-training aligns images and texts in a common feature space, enabling zero-shot transfer via prompting. This work shows that prompt engineering is a major challenge for deploying such models, requiring domain expertise and time-consuming tuning. Inspired by recent advances in natural language processing (NLP) prompt learning, we propose Context Optimization (CoOp), a simple approach for adapting CLIP-like models for downstream image recognition. CoOp models prompt context words with learnable vectors while keeping pre-trained parameters fixed. Two versions are provided: unified context and class-specific context. Extensive experiments on 11 datasets show that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and achieves significant improvements with more shots. Despite being a learning-based approach, CoOp achieves superb domain generalization compared to zero-shot models using hand-crafted prompts.
Introduction: A common approach for building state-of-the-art visual recognition systems is to train vision models to predict for a fixed set of object categories using discrete labels. However, this approach limits visual recognition systems to closed-set visual concepts. Recently, vision-language pre-training such as CLIP and ALIGN has emerged as a promising alternative for visual representation learning. The main idea is to align images and raw texts using two separate encoders. For any new classification task, classification weights can be synthesized by giving sentences describing task-relevant categories to the text encoder. However, identifying the right prompt is a non-trivial task, often requiring significant time for words tuning. For pre-trained vision-language models, the text input, known as prompt, plays a key role in downstream datasets. However, identifying the right prompt is a non-trivial task, which often takes a significant amount of time for words tuning—a slight change in wording could make a huge difference in performance. Inspired by recent advances in prompt learning research in NLP, we propose Context Optimization (CoOp), a simple approach for automating prompt engineering specifically for pre-trained vision-language models. CoOp models a prompt's context words with learnable vectors while keeping the entire pre-trained parameters fixed. Two implementations are provided: unified context and class-specific context. During training, we minimize prediction errors using the cross-entropy loss with respect to the learnable context vectors while keeping the entire pre-trained parameters fixed. The gradients can be back-propagated all the way through the text encoder, distilling the rich knowledge encoded in the parameters for learning task-relevant context.
Experiments: We benchmark CoOp on 11 datasets, covering a diverse set of visual recognition tasks including classification on generic objects, scenes, actions and fine-grained categories