FLORA: Low-Rank Adapters Are Secretly Gradient Compressors

FLORA: Low-Rank Adapters Are Secretly Gradient Compressors

2024 | Yongchang Hao, Yanshuai Cao, Lili Mou
FLORA: Low-Rank Adapters Are Secretly Gradient Compressors This paper introduces FLORA, a novel optimization technique that achieves sublinear memory usage for gradient accumulation and momentum calculation. FLORA is based on the observation that low-rank adaptation (LoRA) can be approximated by a random projection, which compresses the gradient into a lower-dimensional space. By resampling the projection matrices, FLORA enables high-rank updates while maintaining sublinear space complexity for optimization states. The method is effective across different tasks and model architectures, and when combined with Adafactor as a base optimizer, it achieves similar performance to full-matrix updates while outperforming other compression techniques like LoRA. FLORA's space complexity is comparable to LoRA but with a smaller constant, leading to less memory usage. The paper also presents a detailed analysis of the dynamics of LoRA, showing that it can be viewed as a random projection of the gradient. FLORA is applied to both gradient accumulation and momentum calculation, demonstrating its effectiveness in reducing memory usage while maintaining model performance. The experiments show that FLORA significantly reduces memory usage compared to LoRA and other methods, with performance comparable to full-matrix updates. The paper also discusses related work and the potential impact of FLORA on memory-efficient deep learning.FLORA: Low-Rank Adapters Are Secretly Gradient Compressors This paper introduces FLORA, a novel optimization technique that achieves sublinear memory usage for gradient accumulation and momentum calculation. FLORA is based on the observation that low-rank adaptation (LoRA) can be approximated by a random projection, which compresses the gradient into a lower-dimensional space. By resampling the projection matrices, FLORA enables high-rank updates while maintaining sublinear space complexity for optimization states. The method is effective across different tasks and model architectures, and when combined with Adafactor as a base optimizer, it achieves similar performance to full-matrix updates while outperforming other compression techniques like LoRA. FLORA's space complexity is comparable to LoRA but with a smaller constant, leading to less memory usage. The paper also presents a detailed analysis of the dynamics of LoRA, showing that it can be viewed as a random projection of the gradient. FLORA is applied to both gradient accumulation and momentum calculation, demonstrating its effectiveness in reducing memory usage while maintaining model performance. The experiments show that FLORA significantly reduces memory usage compared to LoRA and other methods, with performance comparable to full-matrix updates. The paper also discusses related work and the potential impact of FLORA on memory-efficient deep learning.
Reach us at info@study.space