BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

2024 | Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi
BiLLM is a novel post-training quantization method for large language models (LLMs) that achieves high accuracy with 1-bit weights. The method leverages the distribution of weights in LLMs, identifying and structurally selecting important weights, and using a binary residual approximation strategy to minimize compression loss. For non-important weights, an optimal splitting search is used to group and binarize them accurately. BiLLM achieves high-accuracy inference with 1.08-bit weights across various LLM families and evaluation metrics, outperforming existing quantization methods. It also enables the binarization of a 7-billion LLM within 0.5 hours on a single GPU, demonstrating satisfactory time efficiency. The method is validated across multiple LLM families and fine-tuned instruction models, showing robust performance. BiLLM advances the bit-width quantization frontier of LLMs, promising to facilitate the deployment of LLMs in edge scenarios and resource-constrained devices. The work is supported by various funding sources and has been published in multiple academic references.BiLLM is a novel post-training quantization method for large language models (LLMs) that achieves high accuracy with 1-bit weights. The method leverages the distribution of weights in LLMs, identifying and structurally selecting important weights, and using a binary residual approximation strategy to minimize compression loss. For non-important weights, an optimal splitting search is used to group and binarize them accurately. BiLLM achieves high-accuracy inference with 1.08-bit weights across various LLM families and evaluation metrics, outperforming existing quantization methods. It also enables the binarization of a 7-billion LLM within 0.5 hours on a single GPU, demonstrating satisfactory time efficiency. The method is validated across multiple LLM families and fine-tuned instruction models, showing robust performance. BiLLM advances the bit-width quantization frontier of LLMs, promising to facilitate the deployment of LLMs in edge scenarios and resource-constrained devices. The work is supported by various funding sources and has been published in multiple academic references.
Reach us at info@study.space
[slides and audio] BiLLM%3A Pushing the Limit of Post-Training Quantization for LLMs