Learning to Prompt with Text Only Supervision for Vision-Language Models

Learning to Prompt with Text Only Supervision for Vision-Language Models

4 Jan 2024 | Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc Van Gool, Federico Tombari
The paper "Learning to Prompt with Text Only Supervision for Vision-Language Models" addresses the challenge of adapting foundational vision-language models like CLIP for downstream tasks while maintaining their generalization capabilities. The authors propose a novel approach called ProText, which learns prompts using only text data derived from large language models (LLMs). Unlike existing methods that require labeled images or generate class-specific prompts, ProText leverages LLMs to generate rich contextual knowledge, enabling zero-shot transfer to new classes and datasets. The method is designed to reduce the cost of LLM serving and prompt engineering by allowing prompts to be directly used with new classes and datasets. Extensive evaluations on four benchmarks show that ProText improves over prior methods, including those that use labeled images, demonstrating its effectiveness in improving CLIP's generalization.The paper "Learning to Prompt with Text Only Supervision for Vision-Language Models" addresses the challenge of adapting foundational vision-language models like CLIP for downstream tasks while maintaining their generalization capabilities. The authors propose a novel approach called ProText, which learns prompts using only text data derived from large language models (LLMs). Unlike existing methods that require labeled images or generate class-specific prompts, ProText leverages LLMs to generate rich contextual knowledge, enabling zero-shot transfer to new classes and datasets. The method is designed to reduce the cost of LLM serving and prompt engineering by allowing prompts to be directly used with new classes and datasets. Extensive evaluations on four benchmarks show that ProText improves over prior methods, including those that use labeled images, demonstrating its effectiveness in improving CLIP's generalization.
Reach us at info@study.space