A Comprehensive Evaluation of Quantization Strategies for Large Language Models

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

6 Jun 2024 | Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong
A comprehensive evaluation of quantization strategies for large language models (LLMs) is presented, focusing on the performance of instruction-tuned LLMs under various quantization methods. The study evaluates LLMs across three critical dimensions: knowledge & capacity, alignment, and efficiency, using ten diverse benchmarks. The results show that 4-bit quantization can maintain performance comparable to non-quantized models, while perplexity serves as a reliable proxy metric for quantized LLMs on most benchmarks. Larger parameter-scale quantized LLMs outperform smaller ones, but quantization can slow down inference speed. The study also identifies that isolating outlier weights enables SpQR to effectively quantize LLMs to 2 bits, outperforming GPTQ at the same level. Despite the memory savings from quantization, substantial engineering efforts and hardware support are needed to balance decoding speed and memory consumption. The evaluation highlights the importance of considering alignment and performance metrics when deploying quantized LLMs. The findings suggest that while 4-bit quantization maintains performance, lower bit quantization leads to performance degradation. The study also notes that quantization can affect the truthfulness and social bias of LLMs, emphasizing the need for further research in this area. The results indicate that quantized LLMs with larger parameter scales are preferable when memory is limited, but inference speed remains a concern. The study concludes that quantization strategies must be carefully evaluated to ensure they meet the requirements of real-world applications.A comprehensive evaluation of quantization strategies for large language models (LLMs) is presented, focusing on the performance of instruction-tuned LLMs under various quantization methods. The study evaluates LLMs across three critical dimensions: knowledge & capacity, alignment, and efficiency, using ten diverse benchmarks. The results show that 4-bit quantization can maintain performance comparable to non-quantized models, while perplexity serves as a reliable proxy metric for quantized LLMs on most benchmarks. Larger parameter-scale quantized LLMs outperform smaller ones, but quantization can slow down inference speed. The study also identifies that isolating outlier weights enables SpQR to effectively quantize LLMs to 2 bits, outperforming GPTQ at the same level. Despite the memory savings from quantization, substantial engineering efforts and hardware support are needed to balance decoding speed and memory consumption. The evaluation highlights the importance of considering alignment and performance metrics when deploying quantized LLMs. The findings suggest that while 4-bit quantization maintains performance, lower bit quantization leads to performance degradation. The study also notes that quantization can affect the truthfulness and social bias of LLMs, emphasizing the need for further research in this area. The results indicate that quantized LLMs with larger parameter scales are preferable when memory is limited, but inference speed remains a concern. The study concludes that quantization strategies must be carefully evaluated to ensure they meet the requirements of real-world applications.
Reach us at info@study.space