PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

13 Aug 2024 | Zheng Li1, Xiang Li2,1*, Xinyi Fu3, Xin Zhang1, Weiqiang Wang3, Shuo Chen4, Jian Yang1*
**PromptKD: Unsupervised Prompt Distillation for Vision-Language Models** **Authors:** Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, Jian Yang **Abstract:** Prompt learning has emerged as a valuable technique to enhance vision-language models (VLMs) like CLIP for specific domain tasks. Existing methods primarily focus on designing various learning forms of prompts, neglecting the potential of prompts as effective distillers from larger teacher models. This paper introduces an unsupervised domain prompt distillation framework that transfers knowledge from a large teacher model to a lightweight target model using unlabeled domain images. The framework consists of two stages: pre-training a large CLIP teacher model using domain few-shot labels and then leveraging pre-computed text features as class vectors. In the second stage, these class vectors are shared across the teacher and student image encoders to calculate predicted logits. The logits of both models are aligned via KL divergence, encouraging the student image encoder to generate similar probability distributions to the teacher through learnable prompts. This process eliminates the need for labeled data, enabling the use of extensive unlabeled images within the domain. Extensive experiments on 11 datasets demonstrate the effectiveness of the method, achieving state-of-the-art performance on 11 diverse recognition datasets. **Contributions:** - First to perform unsupervised domain-specific prompt-driven knowledge distillation for CLIP using unlabeled domain data. - Leverages CLIP's decoupled-modality property to reuse pre-stored text features without additional computation costs. - Utilizes the teacher to generate soft labels on extensive unlabeled domain data, enabling student training without labeled images. - Achieves significant improvements on 11 diverse recognition datasets. **Methods:** The proposed method involves two stages: teacher pre-training and student prompt distillation. In the initial stage, a large CLIP teacher model is pre-trained using domain few-shot labeled data. After pre-training, text features are extracted and stored as class vectors. In the subsequent stage, these class vectors are shared across the teacher and student image encoders to calculate predicted logits. KL divergence is used to align the logits of both models, encouraging the student image encoder to generate similar probability distributions to the teacher. The well-trained student image encoder and pre-stored text features are then used for inference. **Experiments:** Extensive experiments on 11 datasets demonstrate the effectiveness of the method, achieving state-of-the-art performance. The method outperforms previous methods on all datasets, showing strong generalization ability and significant improvements on base and novel classes. Ablation studies and comparisons with other methods further validate the effectiveness of the proposed approach.**PromptKD: Unsupervised Prompt Distillation for Vision-Language Models** **Authors:** Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, Jian Yang **Abstract:** Prompt learning has emerged as a valuable technique to enhance vision-language models (VLMs) like CLIP for specific domain tasks. Existing methods primarily focus on designing various learning forms of prompts, neglecting the potential of prompts as effective distillers from larger teacher models. This paper introduces an unsupervised domain prompt distillation framework that transfers knowledge from a large teacher model to a lightweight target model using unlabeled domain images. The framework consists of two stages: pre-training a large CLIP teacher model using domain few-shot labels and then leveraging pre-computed text features as class vectors. In the second stage, these class vectors are shared across the teacher and student image encoders to calculate predicted logits. The logits of both models are aligned via KL divergence, encouraging the student image encoder to generate similar probability distributions to the teacher through learnable prompts. This process eliminates the need for labeled data, enabling the use of extensive unlabeled images within the domain. Extensive experiments on 11 datasets demonstrate the effectiveness of the method, achieving state-of-the-art performance on 11 diverse recognition datasets. **Contributions:** - First to perform unsupervised domain-specific prompt-driven knowledge distillation for CLIP using unlabeled domain data. - Leverages CLIP's decoupled-modality property to reuse pre-stored text features without additional computation costs. - Utilizes the teacher to generate soft labels on extensive unlabeled domain data, enabling student training without labeled images. - Achieves significant improvements on 11 diverse recognition datasets. **Methods:** The proposed method involves two stages: teacher pre-training and student prompt distillation. In the initial stage, a large CLIP teacher model is pre-trained using domain few-shot labeled data. After pre-training, text features are extracted and stored as class vectors. In the subsequent stage, these class vectors are shared across the teacher and student image encoders to calculate predicted logits. KL divergence is used to align the logits of both models, encouraging the student image encoder to generate similar probability distributions to the teacher. The well-trained student image encoder and pre-stored text features are then used for inference. **Experiments:** Extensive experiments on 11 datasets demonstrate the effectiveness of the method, achieving state-of-the-art performance. The method outperforms previous methods on all datasets, showing strong generalization ability and significant improvements on base and novel classes. Ablation studies and comparisons with other methods further validate the effectiveness of the proposed approach.
Reach us at info@study.space