This paper explores the effectiveness of post-training quantization (PTQ) in reducing memory consumption and computational costs for large language models (LLMs). The authors conduct a comprehensive analysis of various quantization schemes, including weight-only, activation-only, and combined weight-and-activation quantization, using methods such as round-to-nearest (RTN), GPTQ, ZeroQuant, and their variants. They apply these methods to two distinct model families, OPT and BLOOM, with parameter counts ranging from 125M to 176B. Key findings include:
1. **Sensitivity Analysis**: Activation quantization is generally more sensitive to weight quantization, and smaller models often outperform larger models in terms of activation quantization.
2. **Evaluation of PTQ Methods**: Current PTQ methods struggle to achieve the original model quality for INT4-weight or INT4-weight-and-INT8-activation quantization, with minimal accuracy degradation.
3. **Proposed Method (LoRC)**: Based on these insights, the authors propose Low Rank Compensation (LoRC), which uses low-rank matrices to enhance model quality recovery with minimal increase in model size.
The paper also provides detailed evaluations of different quantization schemes, model families, and quantization bit precisions, offering a comprehensive understanding of the trade-offs between model size and quality. The authors recommend specific quantization strategies for different model sizes and types, emphasizing the importance of fine-grained quantization and the benefits of combining it with LoRC.This paper explores the effectiveness of post-training quantization (PTQ) in reducing memory consumption and computational costs for large language models (LLMs). The authors conduct a comprehensive analysis of various quantization schemes, including weight-only, activation-only, and combined weight-and-activation quantization, using methods such as round-to-nearest (RTN), GPTQ, ZeroQuant, and their variants. They apply these methods to two distinct model families, OPT and BLOOM, with parameter counts ranging from 125M to 176B. Key findings include:
1. **Sensitivity Analysis**: Activation quantization is generally more sensitive to weight quantization, and smaller models often outperform larger models in terms of activation quantization.
2. **Evaluation of PTQ Methods**: Current PTQ methods struggle to achieve the original model quality for INT4-weight or INT4-weight-and-INT8-activation quantization, with minimal accuracy degradation.
3. **Proposed Method (LoRC)**: Based on these insights, the authors propose Low Rank Compensation (LoRC), which uses low-rank matrices to enhance model quality recovery with minimal increase in model size.
The paper also provides detailed evaluations of different quantization schemes, model families, and quantization bit precisions, offering a comprehensive understanding of the trade-offs between model size and quality. The authors recommend specific quantization strategies for different model sizes and types, emphasizing the importance of fine-grained quantization and the benefits of combining it with LoRC.