27 Mar 2024 | Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El Saddik, Eric Xing
This paper introduces TDA, a training-free dynamic adapter for efficient and effective test-time adaptation of vision-language models (VLMs). TDA uses a lightweight key-value cache to maintain a dynamic queue of test-sample features as keys and corresponding few-shot pseudo labels as values. This allows for gradual adaptation to test data through progressive pseudo label refinement, which is super-efficient without backpropagation. Additionally, TDA introduces negative pseudo labeling to reduce the impact of noisy pseudo labels by assigning pseudo labels to certain negative classes when the model is uncertain about its predictions. Extensive experiments on two benchmarks demonstrate that TDA outperforms state-of-the-art methods in terms of both accuracy and efficiency, significantly reducing testing time from over 12 hours to 16 minutes on the ImageNet dataset.
TDA is designed to be training-free and efficient, making it suitable for real-world applications where models need to adapt to new environments quickly. It is particularly effective in scenarios with distribution shifts between training and test data. The method leverages a dynamic cache that stores knowledge from a stream of test samples, enabling the model to generate positive and negative predictions that are combined with CLIP predictions to produce the final prediction. The positive cache collects high-quality few-shot pseudo labels, while the negative cache addresses the adverse effects of noisy pseudo labels by identifying class absence rather than presence. By combining the positive and negative caches, TDA achieves superior performance in terms of both speed and accuracy.
Compared to existing test-time adaptation methods such as TPT and DiffTPT, TDA is more efficient and effective. It reduces testing time significantly and improves accuracy on various benchmarks. The method is also more robust to noisy pseudo labels and generalizes better to testing data. The results of extensive experiments on two benchmarks demonstrate that TDA outperforms state-of-the-art test-time adaptation methods while significantly reducing testing time. This work contributes to the research and application values of test-time adaptation and presents a promising solution to the efficiency issue of test-time adaptation of vision-language models.This paper introduces TDA, a training-free dynamic adapter for efficient and effective test-time adaptation of vision-language models (VLMs). TDA uses a lightweight key-value cache to maintain a dynamic queue of test-sample features as keys and corresponding few-shot pseudo labels as values. This allows for gradual adaptation to test data through progressive pseudo label refinement, which is super-efficient without backpropagation. Additionally, TDA introduces negative pseudo labeling to reduce the impact of noisy pseudo labels by assigning pseudo labels to certain negative classes when the model is uncertain about its predictions. Extensive experiments on two benchmarks demonstrate that TDA outperforms state-of-the-art methods in terms of both accuracy and efficiency, significantly reducing testing time from over 12 hours to 16 minutes on the ImageNet dataset.
TDA is designed to be training-free and efficient, making it suitable for real-world applications where models need to adapt to new environments quickly. It is particularly effective in scenarios with distribution shifts between training and test data. The method leverages a dynamic cache that stores knowledge from a stream of test samples, enabling the model to generate positive and negative predictions that are combined with CLIP predictions to produce the final prediction. The positive cache collects high-quality few-shot pseudo labels, while the negative cache addresses the adverse effects of noisy pseudo labels by identifying class absence rather than presence. By combining the positive and negative caches, TDA achieves superior performance in terms of both speed and accuracy.
Compared to existing test-time adaptation methods such as TPT and DiffTPT, TDA is more efficient and effective. It reduces testing time significantly and improves accuracy on various benchmarks. The method is also more robust to noisy pseudo labels and generalizes better to testing data. The results of extensive experiments on two benchmarks demonstrate that TDA outperforms state-of-the-art test-time adaptation methods while significantly reducing testing time. This work contributes to the research and application values of test-time adaptation and presents a promising solution to the efficiency issue of test-time adaptation of vision-language models.