[slides] Q-GaLore%3A Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

Q-GaLore is a novel approach designed to reduce memory usage in training Large Language Models (LLMs) by combining quantization and low-rank projection. It addresses the limitations of existing methods like GaLore, which rely on time-consuming Singular Value Decomposition (SVD) operations and offer minimal improvements in accuracy and efficiency. Q-GaLore leverages two key observations: (1) the gradient subspace exhibits diverse properties, with some layers converging early while others change frequently; and (2) projection matrices are highly resilient to low-bit quantization. By adaptively updating the gradient subspace based on its convergence statistics and maintaining projection matrices in INT4 format, Q-GaLore significantly reduces memory usage while achieving comparable performance. The method also uses stochastic rounding to capture accumulated gradient information, enabling high-precision training trajectories using low-precision weights. Experiments demonstrate that Q-GaLore achieves competitive pre-training and fine-tuning performance with exceptional memory efficiency, making it suitable for training LLMs on devices with limited memory, such as the NVIDIA RTX 4060 Ti.Q-GaLore is a novel approach designed to reduce memory usage in training Large Language Models (LLMs) by combining quantization and low-rank projection. It addresses the limitations of existing methods like GaLore, which rely on time-consuming Singular Value Decomposition (SVD) operations and offer minimal improvements in accuracy and efficiency. Q-GaLore leverages two key observations: (1) the gradient subspace exhibits diverse properties, with some layers converging early while others change frequently; and (2) projection matrices are highly resilient to low-bit quantization. By adaptively updating the gradient subspace based on its convergence statistics and maintaining projection matrices in INT4 format, Q-GaLore significantly reduces memory usage while achieving comparable performance. The method also uses stochastic rounding to capture accumulated gradient information, enabling high-precision training trajectories using low-precision weights. Experiments demonstrate that Q-GaLore achieves competitive pre-training and fine-tuning performance with exceptional memory efficiency, making it suitable for training LLMs on devices with limited memory, such as the NVIDIA RTX 4060 Ti.

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

11 Jul 2024 | Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, Zhangyang Wang