[slides and audio] Flora%3A Low-Rank Adapters Are Secretly Gradient Compressors

FLORA: Low-Rank Adapters Are Secretly Gradient Compressors This paper introduces FLORA, a novel optimization technique that achieves sublinear memory usage for gradient accumulation and momentum. FLORA is inspired by the observation that low-rank adapters (LoRA) can be approximated by random projections. LoRA reduces optimization states by training fewer parameters but restricts weight updates to be low-rank, limiting model performance. FLORA resamples the projection matrices used in LoRA, enabling high-rank updates while maintaining sublinear space complexity. The key contributions of this work are: 1. **Dynamics of LoRA**: LoRA updates are dominated by a random projection, compressing gradients into a lower-dimensional space. 2. **Random Projection of Gradients**: LoRA can be interpreted as performing random projection and decompression of gradients. 3. **FLORA Method**: FLORA resamples random projection matrices, allowing high-rank updates and sublinear memory usage for gradient accumulation and momentum calculation. Experiments across different tasks and model architectures (e.g., T5 and GPT-2) demonstrate that FLORA reduces memory usage while maintaining or improving performance compared to LoRA and other methods. FLORA is particularly effective in large models, achieving significant memory savings without sacrificing accuracy.FLORA: Low-Rank Adapters Are Secretly Gradient Compressors This paper introduces FLORA, a novel optimization technique that achieves sublinear memory usage for gradient accumulation and momentum. FLORA is inspired by the observation that low-rank adapters (LoRA) can be approximated by random projections. LoRA reduces optimization states by training fewer parameters but restricts weight updates to be low-rank, limiting model performance. FLORA resamples the projection matrices used in LoRA, enabling high-rank updates while maintaining sublinear space complexity. The key contributions of this work are: 1. **Dynamics of LoRA**: LoRA updates are dominated by a random projection, compressing gradients into a lower-dimensional space. 2. **Random Projection of Gradients**: LoRA can be interpreted as performing random projection and decompression of gradients. 3. **FLORA Method**: FLORA resamples random projection matrices, allowing high-rank updates and sublinear memory usage for gradient accumulation and momentum calculation. Experiments across different tasks and model architectures (e.g., T5 and GPT-2) demonstrate that FLORA reduces memory usage while maintaining or improving performance compared to LoRA and other methods. FLORA is particularly effective in large models, achieving significant memory savings without sacrificing accuracy.

FLORA: Low-Rank Adapters Are Secretly Gradient Compressors

2024 | Yongchang Hao, Yanshuai Cao, Lili Mou