[slides and audio] GaLore%3A Memory-Efficient LLM Training by Gradient Low-Rank Projection

The paper introduces GaLore, a memory-efficient training strategy for Large Language Models (LLMs). Traditional low-rank adaptation methods, such as LoRA, reduce memory usage by approximating the weight matrix with a low-rank factorization but often underperform in terms of both efficiency and performance. GaLore leverages the low-rank structure of gradients during training, which are naturally low-rank for certain gradient forms and network architectures. By projecting the gradient matrix into a low-rank form, GaLore significantly reduces the memory cost of optimizer states, achieving up to 65.5% reduction in memory usage. This approach maintains or improves performance on pre-training and fine-tuning tasks, including LLaMA 1B and 7B models on the C4 dataset and RoBERTa on GLUE tasks. Notably, GaLore enables pre-training a 7B model on consumer GPUs with 24GB memory without model parallel, checkpointing, or offloading strategies. The method is compatible with existing memory-efficient optimization techniques, such as 8-bit optimizers and per-layer weight updates, and can be applied to various learning algorithms. The paper also discusses the theoretical foundations, experimental results, and ablation studies to validate the effectiveness and efficiency of GaLore.The paper introduces GaLore, a memory-efficient training strategy for Large Language Models (LLMs). Traditional low-rank adaptation methods, such as LoRA, reduce memory usage by approximating the weight matrix with a low-rank factorization but often underperform in terms of both efficiency and performance. GaLore leverages the low-rank structure of gradients during training, which are naturally low-rank for certain gradient forms and network architectures. By projecting the gradient matrix into a low-rank form, GaLore significantly reduces the memory cost of optimizer states, achieving up to 65.5% reduction in memory usage. This approach maintains or improves performance on pre-training and fine-tuning tasks, including LLaMA 1B and 7B models on the C4 dataset and RoBERTa on GLUE tasks. Notably, GaLore enables pre-training a 7B model on consumer GPUs with 24GB memory without model parallel, checkpointing, or offloading strategies. The method is compatible with existing memory-efficient optimization techniques, such as 8-bit optimizers and per-layer weight updates, and can be applied to various learning algorithms. The paper also discusses the theoretical foundations, experimental results, and ablation studies to validate the effectiveness and efficiency of GaLore.

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

2024 | Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian