[slides] LoRA-GA%3A Low-Rank Adaptation with Gradient Approximation

**LoRA-GA: Low-Rank Adaptation with Gradient Approximation** **Authors:** Shaowen Wang, Linxi Yu, Jian Li **Affiliation:** Tsinghua University, Beijing, China **Abstract:** Fine-tuning large-scale pre-trained models is computationally and memory-intensive. LoRA, a popular Parameter-Efficient Fine-Tuning (PEFT) method, reduces these costs by fine-tuning an auxiliary low-rank model. However, LoRA converges slower than full fine-tuning, leading to increased computational costs and potentially worse performance. This paper investigates the initialization method of LoRA and introduces LoRA-GA, a novel initialization technique that aligns the gradients of low-rank matrix products with those of full fine-tuning. Extensive experiments show that LoRA-GA achieves comparable convergence rates to full fine-tuning while improving performance. On the GLUE dataset with T5-Base, LoRA-GA outperforms LoRA by 5.69% on average. On larger models like Llama 2-7B, it shows improvements of 0.34, 11.52%, and 5.05% on MT-bench, GSM8K, and Human-eval, respectively. LoRA-GA also reduces convergence time by up to 2-4 times compared to vanilla LoRA. **Contributions:** 1. We propose LoRA-GA, a novel initialization method for LoRA that accelerates convergence by approximating the gradients of low-rank matrices with those of full weight matrices. 2. We identify the scaling factor under non-zero initialization to ensure variance stability. 3. We validate LoRA-GA through extensive experiments, demonstrating significant performance improvements and faster convergence compared to vanilla LoRA. **Methods:** LoRA-GA consists of two key components: gradient approximation and stable scale initialization. We initialize $A_{\text{init}}$ and $B_{\text{init}}$ using the eigenvectors of the full gradient matrix, ensuring that the first-step update $\Delta(\eta BA)$ approximates the direction of the weight update $\Delta W$. We determine the scaling factor $\zeta$ to ensure rank and scale stability. **Experiments:** We evaluate LoRA-GA on various benchmark datasets, including the GLUE dataset with T5-Base and large language models like Llama 2-7B. Results show that LoRA-GA consistently outperforms vanilla LoRA and other baseline methods, achieving performance comparable to full fine-tuning while reducing computational costs. **Ablation Study:** We conduct ablation studies to evaluate the contributions of non-zero initialization, stable output, and gradient approximation in LoRA-GA. The results show that both components improve performance and convergence speed. **Memory and Running Time:** LoRA-GA does not require additional memory beyond what is used for training with Lo**LoRA-GA: Low-Rank Adaptation with Gradient Approximation** **Authors:** Shaowen Wang, Linxi Yu, Jian Li **Affiliation:** Tsinghua University, Beijing, China **Abstract:** Fine-tuning large-scale pre-trained models is computationally and memory-intensive. LoRA, a popular Parameter-Efficient Fine-Tuning (PEFT) method, reduces these costs by fine-tuning an auxiliary low-rank model. However, LoRA converges slower than full fine-tuning, leading to increased computational costs and potentially worse performance. This paper investigates the initialization method of LoRA and introduces LoRA-GA, a novel initialization technique that aligns the gradients of low-rank matrix products with those of full fine-tuning. Extensive experiments show that LoRA-GA achieves comparable convergence rates to full fine-tuning while improving performance. On the GLUE dataset with T5-Base, LoRA-GA outperforms LoRA by 5.69% on average. On larger models like Llama 2-7B, it shows improvements of 0.34, 11.52%, and 5.05% on MT-bench, GSM8K, and Human-eval, respectively. LoRA-GA also reduces convergence time by up to 2-4 times compared to vanilla LoRA. **Contributions:** 1. We propose LoRA-GA, a novel initialization method for LoRA that accelerates convergence by approximating the gradients of low-rank matrices with those of full weight matrices. 2. We identify the scaling factor under non-zero initialization to ensure variance stability. 3. We validate LoRA-GA through extensive experiments, demonstrating significant performance improvements and faster convergence compared to vanilla LoRA. **Methods:** LoRA-GA consists of two key components: gradient approximation and stable scale initialization. We initialize $A_{\text{init}}$ and $B_{\text{init}}$ using the eigenvectors of the full gradient matrix, ensuring that the first-step update $\Delta(\eta BA)$ approximates the direction of the weight update $\Delta W$. We determine the scaling factor $\zeta$ to ensure rank and scale stability. **Experiments:** We evaluate LoRA-GA on various benchmark datasets, including the GLUE dataset with T5-Base and large language models like Llama 2-7B. Results show that LoRA-GA consistently outperforms vanilla LoRA and other baseline methods, achieving performance comparable to full fine-tuning while reducing computational costs. **Ablation Study:** We conduct ablation studies to evaluate the contributions of non-zero initialization, stable output, and gradient approximation in LoRA-GA. The results show that both components improve performance and convergence speed. **Memory and Running Time:** LoRA-GA does not require additional memory beyond what is used for training with Lo

LoRA-GA: Low-Rank Adaptation with Gradient Approximation

16 Jul 2024 | Shaowen Wang, Linxi Yu, Jian Li