Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

11 Jul 2024 | Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, Zhangyang Wang
Q-GaLore is a novel method that reduces memory usage in training Large Language Models (LLMs) by combining quantization and low-rank projection. It improves upon GaLore, which uses Singular Value Decomposition (SVD) to project gradients into a low-rank subspace, by introducing INT4 projection matrices and adaptive subspace updates. Q-GaLore reduces the number of SVD operations and leverages the resilience of projection matrices to low-bit quantization. The method maintains weights in INT8 format and uses stochastic rounding to preserve gradient information. This approach enables high-precision training with low-precision weights, achieving competitive performance in both pre-training and fine-tuning. Q-GaLore significantly reduces memory overhead, allowing the training of a 7B LLaMA model on a single NVIDIA RTX 4060 Ti with only 16 GB of memory. It also reduces memory consumption by up to 50% compared to LoRA and GaLore during fine-tuning, while outperforming QLoRA by up to 5.19 on the MMLU benchmark. The method demonstrates efficient memory usage and training stability, making it suitable for a wide range of hardware configurations.Q-GaLore is a novel method that reduces memory usage in training Large Language Models (LLMs) by combining quantization and low-rank projection. It improves upon GaLore, which uses Singular Value Decomposition (SVD) to project gradients into a low-rank subspace, by introducing INT4 projection matrices and adaptive subspace updates. Q-GaLore reduces the number of SVD operations and leverages the resilience of projection matrices to low-bit quantization. The method maintains weights in INT8 format and uses stochastic rounding to preserve gradient information. This approach enables high-precision training with low-precision weights, achieving competitive performance in both pre-training and fine-tuning. Q-GaLore significantly reduces memory overhead, allowing the training of a 7B LLaMA model on a single NVIDIA RTX 4060 Ti with only 16 GB of memory. It also reduces memory consumption by up to 50% compared to LoRA and GaLore during fine-tuning, while outperforming QLoRA by up to 5.19 on the MMLU benchmark. The method demonstrates efficient memory usage and training stability, making it suitable for a wide range of hardware configurations.
Reach us at info@study.space
[slides and audio] Q-GaLore%3A Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients