[slides] Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

This paper explores the effectiveness of post-training quantization (PTQ) in reducing memory consumption and computational costs for large language models (LLMs). The authors conduct a comprehensive analysis of various quantization schemes, including weight-only, activation-only, and combined weight-and-activation quantization, using methods such as round-to-nearest (RTN), GPTQ, ZeroQuant, and their variants. They apply these methods to two distinct model families, OPT and BLOOM, with parameter counts ranging from 125M to 176B. Key findings include: 1. **Sensitivity Analysis**: Activation quantization is generally more sensitive to weight quantization, and smaller models often outperform larger models in terms of activation quantization. 2. **Evaluation of PTQ Methods**: Current PTQ methods struggle to achieve the original model quality for INT4-weight or INT4-weight-and-INT8-activation quantization, with minimal accuracy degradation. 3. **Proposed Method (LoRC)**: Based on these insights, the authors propose Low Rank Compensation (LoRC), which uses low-rank matrices to enhance model quality recovery with minimal increase in model size. The paper also provides detailed evaluations of different quantization schemes, model families, and quantization bit precisions, offering a comprehensive understanding of the trade-offs between model size and quality. The authors recommend specific quantization strategies for different model sizes and types, emphasizing the importance of fine-grained quantization and the benefits of combining it with LoRC.This paper explores the effectiveness of post-training quantization (PTQ) in reducing memory consumption and computational costs for large language models (LLMs). The authors conduct a comprehensive analysis of various quantization schemes, including weight-only, activation-only, and combined weight-and-activation quantization, using methods such as round-to-nearest (RTN), GPTQ, ZeroQuant, and their variants. They apply these methods to two distinct model families, OPT and BLOOM, with parameter counts ranging from 125M to 176B. Key findings include: 1. **Sensitivity Analysis**: Activation quantization is generally more sensitive to weight quantization, and smaller models often outperform larger models in terms of activation quantization. 2. **Evaluation of PTQ Methods**: Current PTQ methods struggle to achieve the original model quality for INT4-weight or INT4-weight-and-INT8-activation quantization, with minimal accuracy degradation. 3. **Proposed Method (LoRC)**: Based on these insights, the authors propose Low Rank Compensation (LoRC), which uses low-rank matrices to enhance model quality recovery with minimal increase in model size. The paper also provides detailed evaluations of different quantization schemes, model families, and quantization bit precisions, offering a comprehensive understanding of the trade-offs between model size and quality. The authors recommend specific quantization strategies for different model sizes and types, emphasizing the importance of fine-grained quantization and the benefits of combining it with LoRC.

Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

2024 | Zhewei Yao*1, Xiaoxia Wu*1, Cheng Li1, Stephen Youn1, Yuxiong He1

2024 | Zhewei Yao1, Xiaoxia Wu1, Cheng Li1, Stephen Youn1, Yuxiong He1