CLIP-Adapter: Better Vision-Language Models with Feature Adapters

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

9 Oct 2021 | Peng Gao*1, Shijie Geng*2, Renrui Zhang*1, Teli Ma1, Rongyao Fang3, Yongfeng Zhang2, Hongsheng Li3, Yu Qiao1
The paper introduces CLIP-Adapter, a method for improving vision-language models through fine-tuning with feature adapters, rather than prompt tuning. Unlike prompt tuning, which requires careful engineering of prompts, CLIP-Adapter uses lightweight feature adapters to fine-tune either the visual or language branch of the model. The adapters learn new features and blend them with the original pre-trained features using residual connections, preventing overfitting and improving performance. Experiments on various visual classification tasks demonstrate that CLIP-Adapter outperforms context optimization while maintaining a simpler design. The paper also includes extensive ablation studies to validate the effectiveness of the proposed approach.The paper introduces CLIP-Adapter, a method for improving vision-language models through fine-tuning with feature adapters, rather than prompt tuning. Unlike prompt tuning, which requires careful engineering of prompts, CLIP-Adapter uses lightweight feature adapters to fine-tune either the visual or language branch of the model. The adapters learn new features and blend them with the original pre-trained features using residual connections, preventing overfitting and improving performance. Experiments on various visual classification tasks demonstrate that CLIP-Adapter outperforms context optimization while maintaining a simpler design. The paper also includes extensive ablation studies to validate the effectiveness of the proposed approach.
Reach us at info@study.space