Understanding SliM-LLM%3A Salience-Driven Mixed-Precision Quantization for Large Language Models

The paper introduces SliM-LLM, a novel mixed-precision quantization technique for large language models (LLMs) to improve accuracy and efficiency at ultra-low bit-widths. SliM-LLM addresses the challenges of existing post-training quantization (PTQ) methods, which often struggle with accuracy and hardware efficiency, especially at bit-widths below 4. The key contributions of SliM-LLM are two novel techniques: (1) *Salience-Determined Bit Allocation* (SBA), which optimizes bit-width allocation for each quantization group based on the salience distribution of weights, and (2) *Salience-Weighted Quantizer Calibration* (SQC), which enhances the quantizer's awareness of locally salient weights within groups. These techniques balance the preservation of salient information and minimize errors, leading to significant improvements in accuracy and efficiency. Experiments on various LLMs show that SliM-LLM achieves substantial memory savings and perplexity reductions compared to state-of-the-art methods, demonstrating its effectiveness in practical deployment scenarios. The code for SliM-LLM is available at <https://github.com/Aaronhuang-778/SliM-LLM>.The paper introduces SliM-LLM, a novel mixed-precision quantization technique for large language models (LLMs) to improve accuracy and efficiency at ultra-low bit-widths. SliM-LLM addresses the challenges of existing post-training quantization (PTQ) methods, which often struggle with accuracy and hardware efficiency, especially at bit-widths below 4. The key contributions of SliM-LLM are two novel techniques: (1) *Salience-Determined Bit Allocation* (SBA), which optimizes bit-width allocation for each quantization group based on the salience distribution of weights, and (2) *Salience-Weighted Quantizer Calibration* (SQC), which enhances the quantizer's awareness of locally salient weights within groups. These techniques balance the preservation of salient information and minimize errors, leading to significant improvements in accuracy and efficiency. Experiments on various LLMs show that SliM-LLM achieves substantial memory savings and perplexity reductions compared to state-of-the-art methods, demonstrating its effectiveness in practical deployment scenarios. The code for SliM-LLM is available at <https://github.com/Aaronhuang-778/SliM-LLM>.

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

23 May 2024 | Wei Huang, Haotong Qin*, Yangdong Liu, Yawei Li, Xianglong Liu, Luca Benini, Michele Magno, Xiaojuan Qi