DB-LLM: Accurate Dual-Binarization for Efficient LLMs

DB-LLM: Accurate Dual-Binarization for Efficient LLMs

19 Feb 2024 | Hong Chen, Chengtao Lv, Liang Ding, Haotong Qin, Xianbin Zhou, Yifu Ding, Xuebo Liu, Min Zhang, Jinyang Guo, Xianglong Liu, Dacheng Tao
DB-LLM: Accurate Dual-Binarization for Efficient LLMs Large language models (LLMs) have significantly advanced natural language processing, but their high memory and computational costs hinder practical deployment. Quantization is a key method to improve efficiency, but ultra-low-bit quantization often causes significant accuracy drops. This paper introduces DB-LLM, a novel dual-binarization method for LLMs. At the micro-level, Flexible Dual Binarization (FDB) splits 2-bit weights into two independent binary sets, enhancing representation while maintaining efficiency. At the macro-level, Deviation-Aware Distillation (DAD) addresses prediction distortion by focusing on ambiguous samples. Comprehensive experiments show that DB-LLM outperforms existing state-of-the-art methods in 2-bit quantization, achieving lower perplexity and reduced computational costs. DB-LLM also demonstrates superior performance on various tasks, including language generation and zero-shot tasks. The method combines the efficiency of binarization with the flexibility of 2-bit quantization, achieving a flat loss landscape and improved representation. The results show that DB-LLM significantly outperforms prior quantization strategies, with a 20% reduction in computational consumption compared to the state-of-the-art method under the same bit-width. The method is data-free and avoids overfitting by using distillation techniques. DB-LLM achieves high accuracy and efficiency, making it a promising approach for efficient LLM deployment.DB-LLM: Accurate Dual-Binarization for Efficient LLMs Large language models (LLMs) have significantly advanced natural language processing, but their high memory and computational costs hinder practical deployment. Quantization is a key method to improve efficiency, but ultra-low-bit quantization often causes significant accuracy drops. This paper introduces DB-LLM, a novel dual-binarization method for LLMs. At the micro-level, Flexible Dual Binarization (FDB) splits 2-bit weights into two independent binary sets, enhancing representation while maintaining efficiency. At the macro-level, Deviation-Aware Distillation (DAD) addresses prediction distortion by focusing on ambiguous samples. Comprehensive experiments show that DB-LLM outperforms existing state-of-the-art methods in 2-bit quantization, achieving lower perplexity and reduced computational costs. DB-LLM also demonstrates superior performance on various tasks, including language generation and zero-shot tasks. The method combines the efficiency of binarization with the flexibility of 2-bit quantization, achieving a flat loss landscape and improved representation. The results show that DB-LLM significantly outperforms prior quantization strategies, with a 20% reduction in computational consumption compared to the state-of-the-art method under the same bit-width. The method is data-free and avoids overfitting by using distillation techniques. DB-LLM achieves high accuracy and efficiency, making it a promising approach for efficient LLM deployment.
Reach us at info@study.space
[slides and audio] DB-LLM%3A Accurate Dual-Binarization for Efficient LLMs