OneBit: Towards Extremely Low-bit Large Language Models

OneBit: Towards Extremely Low-bit Large Language Models

2024-11-29 | Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che
This paper introduces OneBit, a novel 1-bit quantization framework for large language models (LLMs), which significantly reduces the storage and computational overhead of deploying LLMs. The framework includes a 1-bit parameter representation method and an effective parameter initialization method based on matrix decomposition to improve the convergence speed of the quantization process. Experimental results show that OneBit achieves good performance, with at least 81% of the non-quantized performance on LLaMA models, using only 1-bit weight matrices. The framework is designed to be more efficient and stable during quantization, and it demonstrates strong performance across various model sizes, including OPT, LLaMA, and LLaMA2. The paper also discusses the challenges of extreme low-bit quantization and proposes a novel Linear layer and Sign-Value-Independent Decomposition (SVID) method to represent LLMs using approximately 1-bit values. The method is evaluated on various tasks, including perplexity and zero-shot accuracy, and shows significant improvements over existing methods. The results demonstrate that OneBit is effective in reducing model size while maintaining high performance, making it suitable for deployment on devices with limited resources. The paper also discusses the efficiency and robustness of the framework, showing that it achieves a good trade-off between model size and performance. The framework is further validated through extensive experiments, demonstrating its effectiveness in practical applications.This paper introduces OneBit, a novel 1-bit quantization framework for large language models (LLMs), which significantly reduces the storage and computational overhead of deploying LLMs. The framework includes a 1-bit parameter representation method and an effective parameter initialization method based on matrix decomposition to improve the convergence speed of the quantization process. Experimental results show that OneBit achieves good performance, with at least 81% of the non-quantized performance on LLaMA models, using only 1-bit weight matrices. The framework is designed to be more efficient and stable during quantization, and it demonstrates strong performance across various model sizes, including OPT, LLaMA, and LLaMA2. The paper also discusses the challenges of extreme low-bit quantization and proposes a novel Linear layer and Sign-Value-Independent Decomposition (SVID) method to represent LLMs using approximately 1-bit values. The method is evaluated on various tasks, including perplexity and zero-shot accuracy, and shows significant improvements over existing methods. The results demonstrate that OneBit is effective in reducing model size while maintaining high performance, making it suitable for deployment on devices with limited resources. The paper also discusses the efficiency and robustness of the framework, showing that it achieves a good trade-off between model size and performance. The framework is further validated through extensive experiments, demonstrating its effectiveness in practical applications.
Reach us at info@study.space