13 Aug 2024 | Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, Jian Yang
PromptKD is an unsupervised prompt distillation framework for vision-language models (VLMs) that transfers knowledge from a large teacher model to a lightweight student model using unlabeled domain data. The framework consists of two stages: teacher pre-training and student prompt distillation. In the first stage, a large CLIP teacher model is pre-trained using domain-specific few-shot labeled data. The teacher's text features are then pre-computed and stored as class vectors. In the second stage, these class vectors are shared between the teacher and student image encoders to calculate predicted logits. The student model is then trained to generate similar probability distributions to the teacher through learnable prompts, using KL divergence to align the logits. This approach eliminates the need for labeled data, enabling the student model to learn from a large amount of unlabeled domain data. The framework leverages the decoupled-modality property of CLIP to reuse pre-stored text features without additional computation costs, facilitating both distillation and inference. Extensive experiments on 11 datasets demonstrate the effectiveness of PromptKD, achieving state-of-the-art performance. The method outperforms previous approaches on multiple datasets, showing significant improvements in accuracy and efficiency. The framework is efficient and practical, with a lower inference cost compared to other prompt learning methods. The results indicate that PromptKD is effective in transferring knowledge from a large teacher model to a lightweight student model, enabling better performance on downstream tasks.PromptKD is an unsupervised prompt distillation framework for vision-language models (VLMs) that transfers knowledge from a large teacher model to a lightweight student model using unlabeled domain data. The framework consists of two stages: teacher pre-training and student prompt distillation. In the first stage, a large CLIP teacher model is pre-trained using domain-specific few-shot labeled data. The teacher's text features are then pre-computed and stored as class vectors. In the second stage, these class vectors are shared between the teacher and student image encoders to calculate predicted logits. The student model is then trained to generate similar probability distributions to the teacher through learnable prompts, using KL divergence to align the logits. This approach eliminates the need for labeled data, enabling the student model to learn from a large amount of unlabeled domain data. The framework leverages the decoupled-modality property of CLIP to reuse pre-stored text features without additional computation costs, facilitating both distillation and inference. Extensive experiments on 11 datasets demonstrate the effectiveness of PromptKD, achieving state-of-the-art performance. The method outperforms previous approaches on multiple datasets, showing significant improvements in accuracy and efficiency. The framework is efficient and practical, with a lower inference cost compared to other prompt learning methods. The results indicate that PromptKD is effective in transferring knowledge from a large teacher model to a lightweight student model, enabling better performance on downstream tasks.