The paper introduces CLIP-Adapter, a method for improving vision-language models through fine-tuning with feature adapters, rather than prompt tuning. Unlike prompt tuning, which requires careful engineering of prompts, CLIP-Adapter uses lightweight feature adapters to fine-tune either the visual or language branch of the model. The adapters learn new features and blend them with the original pre-trained features using residual connections, preventing overfitting and improving performance. Experiments on various visual classification tasks demonstrate that CLIP-Adapter outperforms context optimization while maintaining a simpler design. The paper also includes extensive ablation studies to validate the effectiveness of the proposed approach.The paper introduces CLIP-Adapter, a method for improving vision-language models through fine-tuning with feature adapters, rather than prompt tuning. Unlike prompt tuning, which requires careful engineering of prompts, CLIP-Adapter uses lightweight feature adapters to fine-tune either the visual or language branch of the model. The adapters learn new features and blend them with the original pre-trained features using residual connections, preventing overfitting and improving performance. Experiments on various visual classification tasks demonstrate that CLIP-Adapter outperforms context optimization while maintaining a simpler design. The paper also includes extensive ablation studies to validate the effectiveness of the proposed approach.