LoRA Learns Less and Forgets Less

LoRA Learns Less and Forgets Less

15 May 2024 | Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John P. Cunningham
This paper evaluates the performance of LoRA (Low-Rank Adaptation) and full finetuning on two domains: programming and mathematics. LoRA is a parameter-efficient method that trains low-rank perturbations to selected weight matrices, reducing memory usage. However, the study finds that LoRA substantially underperforms full finetuning in both domains. Despite this, LoRA exhibits better regularization, maintaining the base model's performance on tasks outside the target domain. It also helps maintain more diverse generations compared to full finetuning. The study compares LoRA and full finetuning across two training regimes: instruction finetuning (using question-answer datasets) and continued pretraining (training on large unlabeled datasets). Results show that full finetuning achieves higher performance in both domains, but LoRA performs better in terms of source-domain forgetting and maintains more diverse outputs. LoRA's regularization is stronger than common techniques like weight decay and dropout. The paper also finds that full finetuning learns perturbations with a much higher rank than typical LoRA configurations, which may explain some of the performance gaps. The study concludes with best practices for using LoRA, emphasizing the importance of learning rates, target modules, and rank. LoRA is more sensitive to these hyperparameters than full finetuning. The results suggest that while LoRA is less accurate than full finetuning, it offers better regularization and more diverse outputs, making it a useful tool for certain applications.This paper evaluates the performance of LoRA (Low-Rank Adaptation) and full finetuning on two domains: programming and mathematics. LoRA is a parameter-efficient method that trains low-rank perturbations to selected weight matrices, reducing memory usage. However, the study finds that LoRA substantially underperforms full finetuning in both domains. Despite this, LoRA exhibits better regularization, maintaining the base model's performance on tasks outside the target domain. It also helps maintain more diverse generations compared to full finetuning. The study compares LoRA and full finetuning across two training regimes: instruction finetuning (using question-answer datasets) and continued pretraining (training on large unlabeled datasets). Results show that full finetuning achieves higher performance in both domains, but LoRA performs better in terms of source-domain forgetting and maintains more diverse outputs. LoRA's regularization is stronger than common techniques like weight decay and dropout. The paper also finds that full finetuning learns perturbations with a much higher rank than typical LoRA configurations, which may explain some of the performance gaps. The study concludes with best practices for using LoRA, emphasizing the importance of learning rates, target modules, and rank. LoRA is more sensitive to these hyperparameters than full finetuning. The results suggest that while LoRA is less accurate than full finetuning, it offers better regularization and more diverse outputs, making it a useful tool for certain applications.
Reach us at info@study.space