Learning to Prompt with Text Only Supervision for Vision-Language Models

Learning to Prompt with Text Only Supervision for Vision-Language Models

4 Jan 2024 | Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc Van Gool, Federico Tombari
This paper introduces ProText, a novel method for learning prompts for Vision-Language Models (VLMs) using text-only supervision. Unlike previous methods that rely on image data or class-specific prompts generated by Large Language Models (LLMs), ProText learns prompts using text data generated from LLMs. This approach allows for zero-shot transfer of prompts to new classes and datasets, reducing the need for labeled image data and LLM prompt engineering costs. The method is evaluated on four benchmarks, showing improvements over prior ensembling works and competitiveness with methods that use labeled images. ProText achieves an average gain of +2.08% over CLIP in cross-dataset transfer settings and surpasses the previous best image-supervised method MaPLe by +0.93%. The method is effective in domain generalization, base-to-novel class, and text-only supervised settings. ProText uses a contextual mapping strategy to learn a mapping function that embeds rich contextual knowledge from LLM data within the prompts. This enables the prompts to be directly used with new classes and datasets, potentially reducing LLM serving and prompt engineering costs. The method is implemented using a publicly available CLIP model and is available at https://github.com/muzairkhattak/ProText.This paper introduces ProText, a novel method for learning prompts for Vision-Language Models (VLMs) using text-only supervision. Unlike previous methods that rely on image data or class-specific prompts generated by Large Language Models (LLMs), ProText learns prompts using text data generated from LLMs. This approach allows for zero-shot transfer of prompts to new classes and datasets, reducing the need for labeled image data and LLM prompt engineering costs. The method is evaluated on four benchmarks, showing improvements over prior ensembling works and competitiveness with methods that use labeled images. ProText achieves an average gain of +2.08% over CLIP in cross-dataset transfer settings and surpasses the previous best image-supervised method MaPLe by +0.93%. The method is effective in domain generalization, base-to-novel class, and text-only supervised settings. ProText uses a contextual mapping strategy to learn a mapping function that embeds rich contextual knowledge from LLM data within the prompts. This enables the prompts to be directly used with new classes and datasets, potentially reducing LLM serving and prompt engineering costs. The method is implemented using a publicly available CLIP model and is available at https://github.com/muzairkhattak/ProText.
Reach us at info@study.space
[slides] Learning to Prompt with Text Only Supervision for Vision-Language Models | StudySpace