Low-Rank Few-Shot Adaptation of Vision-Language Models

Low-Rank Few-Shot Adaptation of Vision-Language Models

1 Jun 2024 | Maxime Zanella*, Ismail Ben Ayed
This paper introduces CLIP-LoRA, a low-rank adaptation method for few-shot learning of Vision-Language Models (VLMs). The authors demonstrate that their method outperforms existing prompt- and adapter-based approaches on 11 datasets, while reducing training time and maintaining consistent hyperparameters across all tasks. CLIP-LoRA applies low-rank matrices to the query, key, and value matrices of the vision and text encoders, with a rank of 2. It is shown to be more efficient than other methods, including adapters and prompt-tuning, and requires less computational resources. The method is also more practical, as it does not require extensive hyperparameter tuning for each dataset. The paper also discusses the design considerations for applying LoRA in VLMs, including the choice of encoders, the selection of weight matrices to adapt, and the determination of the appropriate rank for these matrices. The results show that adapting both encoders leads to the best performance on average, and that adapting value or output matrices is the most effective strategy. The paper concludes that LoRA offers a promising approach for few-shot learning of VLMs, and that further research is needed to explore the optimal design choices for this method.This paper introduces CLIP-LoRA, a low-rank adaptation method for few-shot learning of Vision-Language Models (VLMs). The authors demonstrate that their method outperforms existing prompt- and adapter-based approaches on 11 datasets, while reducing training time and maintaining consistent hyperparameters across all tasks. CLIP-LoRA applies low-rank matrices to the query, key, and value matrices of the vision and text encoders, with a rank of 2. It is shown to be more efficient than other methods, including adapters and prompt-tuning, and requires less computational resources. The method is also more practical, as it does not require extensive hyperparameter tuning for each dataset. The paper also discusses the design considerations for applying LoRA in VLMs, including the choice of encoders, the selection of weight matrices to adapt, and the determination of the appropriate rank for these matrices. The results show that adapting both encoders leads to the best performance on average, and that adapting value or output matrices is the most effective strategy. The paper concludes that LoRA offers a promising approach for few-shot learning of VLMs, and that further research is needed to explore the optimal design choices for this method.
Reach us at info@study.space