6 Jun 2024 | Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong
This paper presents a comprehensive evaluation of quantization strategies for large language models (LLMs), addressing the challenges of high computational and memory costs associated with large models. The study proposes a structured evaluation framework that assesses LLMs across three critical dimensions: knowledge & capacity, alignment, and efficiency. Extensive experiments are conducted on ten diverse benchmarks, including language understanding, generation, and alignment tasks. Key findings include:
1. **Performance Comparison**: LLMs with 4-bit quantization can maintain performance comparable to their non-quantized counterparts, while larger parameter-scale models outperform smaller ones.
2. **Perplexity as a Metric**: Perplexity serves as a reliable proxy for quantized LLMs' performance on most benchmarks, with a strong correlation between quantization bits and perplexity.
3. **Efficiency and Engineering Challenges**: Quantization can reduce memory consumption and inference speed, but significant engineering efforts and hardware support are required to optimize these aspects.
4. **SpQR's Effectiveness**: SpQR, a quantization method that isolates outlier weights, effectively quantizes LLMs to 2 bits, outperforming other methods in terms of performance and efficiency.
The study highlights the trade-offs between model efficiency and performance degradation, emphasizing the need for further research to address practical deployment challenges in resource-limited settings.This paper presents a comprehensive evaluation of quantization strategies for large language models (LLMs), addressing the challenges of high computational and memory costs associated with large models. The study proposes a structured evaluation framework that assesses LLMs across three critical dimensions: knowledge & capacity, alignment, and efficiency. Extensive experiments are conducted on ten diverse benchmarks, including language understanding, generation, and alignment tasks. Key findings include:
1. **Performance Comparison**: LLMs with 4-bit quantization can maintain performance comparable to their non-quantized counterparts, while larger parameter-scale models outperform smaller ones.
2. **Perplexity as a Metric**: Perplexity serves as a reliable proxy for quantized LLMs' performance on most benchmarks, with a strong correlation between quantization bits and perplexity.
3. **Efficiency and Engineering Challenges**: Quantization can reduce memory consumption and inference speed, but significant engineering efforts and hardware support are required to optimize these aspects.
4. **SpQR's Effectiveness**: SpQR, a quantization method that isolates outlier weights, effectively quantizes LLMs to 2 bits, outperforming other methods in terms of performance and efficiency.
The study highlights the trade-offs between model efficiency and performance degradation, emphasizing the need for further research to address practical deployment challenges in resource-limited settings.