Accurate LoRA-Finetuning Quantization of LLMs via Information Retention

Accurate LoRA-Finetuning Quantization of LLMs via Information Retention

2024 | Haotong Qin * 1 Xudong Ma * 2 Xingyu Zheng 2 Xiaoyang Li 3 Yang Zhang 3 Shouda Liu 3 Jie Luo 2 Xianglong Liu 2 3 Michele Magno 1
This paper introduces IR-QLoRA, a novel method for accurate quantization of large language models (LLMs) using LoRA-finetuning and information retention. The proposed method aims to address the issue of significant accuracy degradation in quantized LLMs, which is often caused by information loss during quantization. IR-QLoRA consists of two main components: Information Calibration Quantization (ICQ) and Information Elastic Connection (IEC). 1. **Information Calibration Quantization (ICQ)**: ICQ ensures that the quantized parameters of the LLM retain as much original information as possible. It achieves this by maximizing the entropy of the quantized weights, which helps to reduce information loss and improve accuracy. The process involves initializing a calibration constant and then optimizing it to maximize the entropy of the quantized weights. 2. **Information Elastic Connection (IEC)**: IEC enhances the information recovery capability of LoRA by allowing it to utilize diverse information. It constructs parameter-free connections for LoRA, enabling more flexible and diverse information transformations. This helps LoRA to better utilize the original information extracted by the quantized LLM projection. The paper demonstrates the effectiveness of IR-QLoRA through extensive experiments on the LLaMA and LLaMA2 families of models, showing significant improvements in accuracy across various bit-widths (2-4 bits). The method also maintains high efficiency, with only a minimal increase in time consumption. IR-QLoRA is versatile and can be integrated with various quantization frameworks, making it a promising approach for deploying LLMs on resource-constrained hardware.This paper introduces IR-QLoRA, a novel method for accurate quantization of large language models (LLMs) using LoRA-finetuning and information retention. The proposed method aims to address the issue of significant accuracy degradation in quantized LLMs, which is often caused by information loss during quantization. IR-QLoRA consists of two main components: Information Calibration Quantization (ICQ) and Information Elastic Connection (IEC). 1. **Information Calibration Quantization (ICQ)**: ICQ ensures that the quantized parameters of the LLM retain as much original information as possible. It achieves this by maximizing the entropy of the quantized weights, which helps to reduce information loss and improve accuracy. The process involves initializing a calibration constant and then optimizing it to maximize the entropy of the quantized weights. 2. **Information Elastic Connection (IEC)**: IEC enhances the information recovery capability of LoRA by allowing it to utilize diverse information. It constructs parameter-free connections for LoRA, enabling more flexible and diverse information transformations. This helps LoRA to better utilize the original information extracted by the quantized LLM projection. The paper demonstrates the effectiveness of IR-QLoRA through extensive experiments on the LLaMA and LLaMA2 families of models, showing significant improvements in accuracy across various bit-widths (2-4 bits). The method also maintains high efficiency, with only a minimal increase in time consumption. IR-QLoRA is versatile and can be integrated with various quantization frameworks, making it a promising approach for deploying LLMs on resource-constrained hardware.
Reach us at info@study.space
Understanding Accurate LoRA-Finetuning Quantization of LLMs via Information Retention