2024 | Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno
This paper introduces IR-QLoRA, a novel method for accurately quantizing large language models (LLMs) using LoRA-finetuning through information retention. The method addresses the issue of significant information loss during quantization, which limits the accuracy of quantized LLMs. IR-QLoRA incorporates two key technologies: Information Calibration Quantization (ICQ) and Information Elastic Connection (IEC). ICQ ensures that quantized parameters retain original information by optimizing the quantization process based on entropy maximization. IEC enhances the information recovery capability of LoRA by enabling elastic representation transformation with diverse information.
Comprehensive experiments show that IR-QLoRA significantly improves accuracy across LLaMA and LLaMA2 families under 2-4 bit-widths. For example, 4-bit LLaMA-7B achieves a 1.4% improvement in MMLU accuracy compared to state-of-the-art methods. The performance gain requires only a minimal additional time consumption, demonstrating the efficiency of IR-QLoRA. The method is versatile and compatible with various frameworks, providing general accuracy gains. The code is available at https://github.com/htqin/ir-qlora.This paper introduces IR-QLoRA, a novel method for accurately quantizing large language models (LLMs) using LoRA-finetuning through information retention. The method addresses the issue of significant information loss during quantization, which limits the accuracy of quantized LLMs. IR-QLoRA incorporates two key technologies: Information Calibration Quantization (ICQ) and Information Elastic Connection (IEC). ICQ ensures that quantized parameters retain original information by optimizing the quantization process based on entropy maximization. IEC enhances the information recovery capability of LoRA by enabling elastic representation transformation with diverse information.
Comprehensive experiments show that IR-QLoRA significantly improves accuracy across LLaMA and LLaMA2 families under 2-4 bit-widths. For example, 4-bit LLaMA-7B achieves a 1.4% improvement in MMLU accuracy compared to state-of-the-art methods. The performance gain requires only a minimal additional time consumption, demonstrating the efficiency of IR-QLoRA. The method is versatile and compatible with various frameworks, providing general accuracy gains. The code is available at https://github.com/htqin/ir-qlora.