BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation

BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation

16 Feb 2024 | Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, Ningyi Xu
BitDistiller is a novel framework that combines Quantization-Aware Training (QAT) with Knowledge Distillation (KD) to enhance the performance of Large Language Models (LLMs) at ultra-low precisions (sub-4-bit). The framework addresses the challenges of weight quantization, which reduces memory and computational demands while preserving model fidelity. BitDistiller introduces a tailored asymmetric quantization and clipping technique to maximize the preservation of quantized weights and a Confidence-Aware Kullback-Leibler Divergence (CAKLD) objective for self-distillation, enabling faster convergence and superior model performance. Empirical evaluations demonstrate that BitDistiller significantly outperforms existing methods in both 3-bit and 2-bit configurations on various language understanding and complex reasoning benchmarks. Notably, BitDistiller is more cost-effective, requiring fewer training data and resources. The code for BitDistiller is available at <https://github.com/DDuDa/BitDistiller>.BitDistiller is a novel framework that combines Quantization-Aware Training (QAT) with Knowledge Distillation (KD) to enhance the performance of Large Language Models (LLMs) at ultra-low precisions (sub-4-bit). The framework addresses the challenges of weight quantization, which reduces memory and computational demands while preserving model fidelity. BitDistiller introduces a tailored asymmetric quantization and clipping technique to maximize the preservation of quantized weights and a Confidence-Aware Kullback-Leibler Divergence (CAKLD) objective for self-distillation, enabling faster convergence and superior model performance. Empirical evaluations demonstrate that BitDistiller significantly outperforms existing methods in both 3-bit and 2-bit configurations on various language understanding and complex reasoning benchmarks. Notably, BitDistiller is more cost-effective, requiring fewer training data and resources. The code for BitDistiller is available at <https://github.com/DDuDa/BitDistiller>.
Reach us at info@study.space
[slides and audio] BitDistiller%3A Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation