[slides] Low-Rank Few-Shot Adaptation of Vision-Language Models

The paper "Low-Rank Few-Shot Adaptation of Vision-Language Models" by Maxime Zanella and Ismail Ben Ayed introduces a novel method called Low-Rank Adaptation (LoRA) for few-shot learning in Vision-Language Models (VLMs). The authors address the limitations of existing few-shot learning methods, which often rely on heavy training procedures and task-specific hyper-parameters, making them less scalable and applicable to new tasks. LoRA, a parameter-efficient fine-tuning (PEFT) method, models the incremental update of pre-trained weights as the product of two small matrices, significantly reducing computational overhead and memory usage. The paper evaluates LoRA on 11 datasets, comparing it to state-of-the-art prompt- and adapter-based approaches. Surprisingly, LoRA outperforms these methods while maintaining the same hyper-parameters across all tasks, reducing training times. The authors explore different design choices for applying LoRA, such as the choice of encoders, specific weight matrices to adapt, and the rank of the matrices. They find that adapting both vision and text encoders, tuning multiple attention matrices, and choosing an appropriate rank are key factors in achieving optimal performance. The paper also discusses the impact of LoRA module placement and the trade-offs between computational efficiency and performance. Overall, LoRA provides a strong baseline for evaluating progress in few-shot VLMs, highlighting the potential of parameter-efficient fine-tuning methods in this domain.The paper "Low-Rank Few-Shot Adaptation of Vision-Language Models" by Maxime Zanella and Ismail Ben Ayed introduces a novel method called Low-Rank Adaptation (LoRA) for few-shot learning in Vision-Language Models (VLMs). The authors address the limitations of existing few-shot learning methods, which often rely on heavy training procedures and task-specific hyper-parameters, making them less scalable and applicable to new tasks. LoRA, a parameter-efficient fine-tuning (PEFT) method, models the incremental update of pre-trained weights as the product of two small matrices, significantly reducing computational overhead and memory usage. The paper evaluates LoRA on 11 datasets, comparing it to state-of-the-art prompt- and adapter-based approaches. Surprisingly, LoRA outperforms these methods while maintaining the same hyper-parameters across all tasks, reducing training times. The authors explore different design choices for applying LoRA, such as the choice of encoders, specific weight matrices to adapt, and the rank of the matrices. They find that adapting both vision and text encoders, tuning multiple attention matrices, and choosing an appropriate rank are key factors in achieving optimal performance. The paper also discusses the impact of LoRA module placement and the trade-offs between computational efficiency and performance. Overall, LoRA provides a strong baseline for evaluating progress in few-shot VLMs, highlighting the potential of parameter-efficient fine-tuning methods in this domain.

Low-Rank Few-Shot Adaptation of Vision-Language Models

1 Jun 2024 | Maxime Zanella, Ismail Ben Ayed