CLIP-Adapter: Better Vision-Language Models with Feature Adapters

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

9 Oct 2021 | Peng Gao*1, Shijie Geng*2, Renrui Zhang*1, Teli Ma1, Rongyao Fang3, Yongfeng Zhang2, Hongsheng Li3, Yu Qiao1
CLIP-Adapter is a method for improving vision-language models through feature adapters rather than prompt tuning. Unlike traditional visual systems that rely on fixed labels, CLIP-Adapter aligns images with raw texts in an open-vocabulary setting. It uses a bottleneck layer to learn new features and performs residual-style feature blending with pre-trained features, allowing it to outperform context optimization while maintaining a simple design. Experiments on various visual classification tasks show that CLIP-Adapter achieves better performance than prompt tuning methods like CoOp. CLIP-Adapter introduces lightweight feature adapters to fine-tune either the visual or language branch, reducing the number of parameters and preventing overfitting. It uses residual connections to dynamically blend original and newly learned features, enhancing performance. The method is effective across different shot settings and datasets, with ablation studies confirming its ability to learn better feature manifolds. CLIP-Adapter is a promising alternative to prompt tuning for vision-language models.CLIP-Adapter is a method for improving vision-language models through feature adapters rather than prompt tuning. Unlike traditional visual systems that rely on fixed labels, CLIP-Adapter aligns images with raw texts in an open-vocabulary setting. It uses a bottleneck layer to learn new features and performs residual-style feature blending with pre-trained features, allowing it to outperform context optimization while maintaining a simple design. Experiments on various visual classification tasks show that CLIP-Adapter achieves better performance than prompt tuning methods like CoOp. CLIP-Adapter introduces lightweight feature adapters to fine-tune either the visual or language branch, reducing the number of parameters and preventing overfitting. It uses residual connections to dynamically blend original and newly learned features, enhancing performance. The method is effective across different shot settings and datasets, with ablation studies confirming its ability to learn better feature manifolds. CLIP-Adapter is a promising alternative to prompt tuning for vision-language models.
Reach us at info@study.space