2024 | Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He
This paper investigates post-training quantization (PTQ) for large language models (LLMs), analyzing the impact of various quantization schemes, model families, and bit precision. The study evaluates existing PTQ methods, such as round-to-nearest (RTN), GPTQ, and ZeroQuant, on two model families (OPT and BLOOM) with sizes ranging from 125M to 176B. Key findings include: (1) activation quantization is more sensitive to weight quantization, with smaller models often performing better in activation quantization. (2) Current PTQ methods cannot achieve the original model quality for INT4-weight or INT4-weight-and-INT8-activation quantization. (3) A new method, Low-Rank Compensation (LoRC), is proposed to enhance model quality recovery with minimal size increase. LoRC uses low-rank matrix factorization on the quantization error matrix to improve performance. The study also highlights the effectiveness of fine-grained quantization (FGQ) in reducing quantization error, particularly for larger models. Results show that LoRC, when combined with FGQ, can nearly recover the original model quality for INT4 quantization. The paper concludes that PTQ is crucial for improving quantized model quality, and that FGQ provides an acceptable accuracy and size trade-off. LoRC, when integrated with PTQ and FGQ, plays a key role in enhancing full model quality recovery with minimal size increase.This paper investigates post-training quantization (PTQ) for large language models (LLMs), analyzing the impact of various quantization schemes, model families, and bit precision. The study evaluates existing PTQ methods, such as round-to-nearest (RTN), GPTQ, and ZeroQuant, on two model families (OPT and BLOOM) with sizes ranging from 125M to 176B. Key findings include: (1) activation quantization is more sensitive to weight quantization, with smaller models often performing better in activation quantization. (2) Current PTQ methods cannot achieve the original model quality for INT4-weight or INT4-weight-and-INT8-activation quantization. (3) A new method, Low-Rank Compensation (LoRC), is proposed to enhance model quality recovery with minimal size increase. LoRC uses low-rank matrix factorization on the quantization error matrix to improve performance. The study also highlights the effectiveness of fine-grained quantization (FGQ) in reducing quantization error, particularly for larger models. Results show that LoRC, when combined with FGQ, can nearly recover the original model quality for INT4 quantization. The paper concludes that PTQ is crucial for improving quantized model quality, and that FGQ provides an acceptable accuracy and size trade-off. LoRC, when integrated with PTQ and FGQ, plays a key role in enhancing full model quality recovery with minimal size increase.