DB-LLM: Accurate Dual-Binarization for Efficient LLMs

DB-LLM: Accurate Dual-Binarization for Efficient LLMs

19 Feb 2024 | Hong Chen, Chengtao Lv, Liang Ding, Haotong Qin, Xibin Zhou, Yifu Ding, Xuebo Liu, Min Zhang, Jinyang Guo, Xianglong Liu, Dacheng Tao
The paper introduces DB-LLM, a novel method for efficient and accurate quantization of large language models (LLMs). The authors address the challenges of ultra-low-bit quantization, which often leads to significant accuracy drops, by proposing two key components: Flexible Dual Binarization (FDB) and Deviation-Aware Distillation (DAD). 1. **Flexible Dual Binarization (FDB)**: FDB enhances the representation capability of 2-bit quantized weights by splitting them into two independent sets of binaries. This approach leverages the efficiency of bitwise operations while maintaining the high sparsity of ultra-low-bit quantization. FDB ensures that both the accuracy and efficiency of the model are improved. 2. **Deviation-Aware Distillation (DAD)**: DAD addresses the prediction distortions in low-bit LLMs by focusing on ambiguous samples. It uses the teacher-student entropy as an indicator of ambiguity and reweights the distillation loss to prioritize these samples, thereby improving the model's performance on such cases. The paper demonstrates that DB-LLM significantly outperforms existing state-of-the-art (SOTA) quantization methods in terms of both accuracy and computational efficiency. Experiments on various benchmark datasets and model families show that DB-LLM achieves lower perplexities and higher computational savings compared to SOTA methods. The authors also provide detailed ablation studies to validate the effectiveness of each component of DB-LLM. The main contributions of the paper include: - Introducing FDB to enhance representation capability and efficiency. - Proposing DAD to mitigate prediction distortions in low-bit LLMs. - Achieving superior performance and computational efficiency in ultra-low-bit quantization. The paper concludes by discussing potential future directions, such as exploring full binarization and further improving activation and scale value quantization.The paper introduces DB-LLM, a novel method for efficient and accurate quantization of large language models (LLMs). The authors address the challenges of ultra-low-bit quantization, which often leads to significant accuracy drops, by proposing two key components: Flexible Dual Binarization (FDB) and Deviation-Aware Distillation (DAD). 1. **Flexible Dual Binarization (FDB)**: FDB enhances the representation capability of 2-bit quantized weights by splitting them into two independent sets of binaries. This approach leverages the efficiency of bitwise operations while maintaining the high sparsity of ultra-low-bit quantization. FDB ensures that both the accuracy and efficiency of the model are improved. 2. **Deviation-Aware Distillation (DAD)**: DAD addresses the prediction distortions in low-bit LLMs by focusing on ambiguous samples. It uses the teacher-student entropy as an indicator of ambiguity and reweights the distillation loss to prioritize these samples, thereby improving the model's performance on such cases. The paper demonstrates that DB-LLM significantly outperforms existing state-of-the-art (SOTA) quantization methods in terms of both accuracy and computational efficiency. Experiments on various benchmark datasets and model families show that DB-LLM achieves lower perplexities and higher computational savings compared to SOTA methods. The authors also provide detailed ablation studies to validate the effectiveness of each component of DB-LLM. The main contributions of the paper include: - Introducing FDB to enhance representation capability and efficiency. - Proposing DAD to mitigate prediction distortions in low-bit LLMs. - Achieving superior performance and computational efficiency in ultra-low-bit quantization. The paper concludes by discussing potential future directions, such as exploring full binarization and further improving activation and scale value quantization.
Reach us at info@study.space
[slides] DB-LLM%3A Accurate Dual-Binarization for Efficient LLMs | StudySpace