26 Mar 2024 | Yabin Zhang, Wenjie Zhu, Hui Tang, Zhiyuan Ma, Kaiyang Zhou, Lei Zhang
Dual Memory Networks (DMN) is a versatile adaptation approach for vision-language models (VLMs) that effectively handles three task settings: zero-shot, few-shot, and training-free few-shot adaptations. The method introduces dynamic and static memory networks to store and retrieve knowledge from training data and historical test samples. The dynamic memory network preserves features of historical test samples during testing, enabling exploration of additional data insights beyond the training set. The static memory network caches training data knowledge, allowing training-free few-shot adaptation. Both memory networks employ a flexible memory interactive strategy, which can operate in a training-free mode and can be enhanced with learnable projection layers. The DMN approach is tested on 11 datasets and outperforms existing methods in the zero-shot setting by over 3%, even surpassing methods that use external training data. It also demonstrates robust performance against natural distribution shifts. The method is efficient, with no learning parameters required in the training-free setting, and maintains fast inference speed. The DMN framework is validated across various task settings, showing superior performance in both few-shot and training-free few-shot adaptations. The approach is effective in leveraging historical test samples and labeled training data, leading to a new state-of-the-art in vision-language model adaptation.Dual Memory Networks (DMN) is a versatile adaptation approach for vision-language models (VLMs) that effectively handles three task settings: zero-shot, few-shot, and training-free few-shot adaptations. The method introduces dynamic and static memory networks to store and retrieve knowledge from training data and historical test samples. The dynamic memory network preserves features of historical test samples during testing, enabling exploration of additional data insights beyond the training set. The static memory network caches training data knowledge, allowing training-free few-shot adaptation. Both memory networks employ a flexible memory interactive strategy, which can operate in a training-free mode and can be enhanced with learnable projection layers. The DMN approach is tested on 11 datasets and outperforms existing methods in the zero-shot setting by over 3%, even surpassing methods that use external training data. It also demonstrates robust performance against natural distribution shifts. The method is efficient, with no learning parameters required in the training-free setting, and maintains fast inference speed. The DMN framework is validated across various task settings, showing superior performance in both few-shot and training-free few-shot adaptations. The approach is effective in leveraging historical test samples and labeled training data, leading to a new state-of-the-art in vision-language model adaptation.