[slides] BiLLM%3A Pushing the Limit of Post-Training Quantization for LLMs

**BiLLM: Pushing the Limit of Post-Training Quantization for LLMs** **Abstract:** Pretrained large language models (LLMs) are highly effective in language processing but require significant memory and computational resources. Binarization, a powerful compression technique, reduces model weights to 1 bit, significantly lowering these demands. However, existing quantization techniques struggle to maintain LLM performance at ultra-low bit-widths. This paper introduces BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for LLMs. BiLLM identifies and structurally selects salient weights, minimizing compression loss through binary residual approximation. It also proposes an optimal splitting search to accurately binarize non-salient weights, which follow a bell-shaped distribution. BiLLM achieves high-accuracy inference (e.g., 8.41 perplexity on LLaMA-2-70B) with only 1.08-bit weights, outperforming state-of-the-art (SOTA) quantization methods by significant margins. Additionally, BiLLM enables binarization of a 7-billion parameter model within 0.5 hours on a single GPU, demonstrating efficient time consumption. **Introduction:** LLMs, such as OPT and LLaMA, have gained significant attention for their excellent performance in natural language processing. However, their large parameter sizes and computational requirements pose challenges for deployment on memory-constrained devices. Model quantization, particularly post-training quantization (PTQ), has emerged as a solution to reduce model size and GPU memory consumption. While advanced PTQ methods for LLMs have achieved good performance at 8-bit and 4-bit quantization, they often collapse under ultra-low bit-widths. BiLLM addresses this challenge by leveraging the Hessian matrix to identify salient weights and employing a structured selection and binary residual approximation strategy. For non-salient weights, BiLLM uses an optimal splitting binarization strategy to minimize quantization errors. **Method:** BiLLM consists of two core components: structural selection for salient weights and optimal splitting for non-salient weights. The Hessian matrix is used to assess the sensitivity of weights, guiding the selection of salient weights. A residual approximation approach is then applied to these weights to minimize quantization errors. For non-salient weights, an optimal splitting strategy is used to divide the weights into sparse and concentrated areas, followed by separate binarization processes. **Experiments:** BiLLM is evaluated on various LLM families, including OPT and LLaMA, using multiple datasets and evaluation metrics. Results show that BiLLM achieves state-of-the-art performance at 1-bit quantization, outperforming other methods by significant margins. The method also demonstrates efficient time consumption, completing binarization within 0.5 hours on a single GPU for a 7-billion parameter model. **Conclusion:** BiLLM is**BiLLM: Pushing the Limit of Post-Training Quantization for LLMs** **Abstract:** Pretrained large language models (LLMs) are highly effective in language processing but require significant memory and computational resources. Binarization, a powerful compression technique, reduces model weights to 1 bit, significantly lowering these demands. However, existing quantization techniques struggle to maintain LLM performance at ultra-low bit-widths. This paper introduces BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for LLMs. BiLLM identifies and structurally selects salient weights, minimizing compression loss through binary residual approximation. It also proposes an optimal splitting search to accurately binarize non-salient weights, which follow a bell-shaped distribution. BiLLM achieves high-accuracy inference (e.g., 8.41 perplexity on LLaMA-2-70B) with only 1.08-bit weights, outperforming state-of-the-art (SOTA) quantization methods by significant margins. Additionally, BiLLM enables binarization of a 7-billion parameter model within 0.5 hours on a single GPU, demonstrating efficient time consumption. **Introduction:** LLMs, such as OPT and LLaMA, have gained significant attention for their excellent performance in natural language processing. However, their large parameter sizes and computational requirements pose challenges for deployment on memory-constrained devices. Model quantization, particularly post-training quantization (PTQ), has emerged as a solution to reduce model size and GPU memory consumption. While advanced PTQ methods for LLMs have achieved good performance at 8-bit and 4-bit quantization, they often collapse under ultra-low bit-widths. BiLLM addresses this challenge by leveraging the Hessian matrix to identify salient weights and employing a structured selection and binary residual approximation strategy. For non-salient weights, BiLLM uses an optimal splitting binarization strategy to minimize quantization errors. **Method:** BiLLM consists of two core components: structural selection for salient weights and optimal splitting for non-salient weights. The Hessian matrix is used to assess the sensitivity of weights, guiding the selection of salient weights. A residual approximation approach is then applied to these weights to minimize quantization errors. For non-salient weights, an optimal splitting strategy is used to divide the weights into sparse and concentrated areas, followed by separate binarization processes. **Experiments:** BiLLM is evaluated on various LLM families, including OPT and LLaMA, using multiple datasets and evaluation metrics. Results show that BiLLM achieves state-of-the-art performance at 1-bit quantization, outperforming other methods by significant margins. The method also demonstrates efficient time consumption, completing binarization within 0.5 hours on a single GPU for a 7-billion parameter model. **Conclusion:** BiLLM is

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

15 May 2024 | Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi