The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

27 Feb 2024 | Shuming Ma*, Hongyu Wang*, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei*
The paper introduces BitNet b1.58, a 1-bit Large Language Model (LLM) variant where each parameter is ternary (−1, 0, 1). This model matches the performance of full-precision LLMs (e.g., FP16) in terms of perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. BitNet b1.58 defines a new scaling law and training recipe for high-performance, cost-effective LLMs and enables a new computation paradigm, opening the door for hardware optimized for 1-bit LLMs. BitNet b1.58 is based on the BitNet architecture, which replaces nn.Linear with BitLinear. It is trained from scratch with 1.58-bit weights and 8-bit activations. The model uses an absmean quantization function to constrain weights to −1, 0, or +1. It also incorporates LLaMA-alike components such as RMSNorm, SwiGLU, and rotary embedding, and removes all biases, making it compatible with open-source frameworks. Experiments show that BitNet b1.58 outperforms full-precision LLMs in terms of speed, memory usage, and energy efficiency. For example, a 3.9B BitNet b1.58 model is 2.4 times faster and uses 3.32 times less memory than a 3B LLaMA LLM. The model also achieves similar performance to full-precision baselines starting from a 3B size. At 70B, BitNet b1.58 is 4.1 times faster and uses significantly less memory than the LLaMA LLM. The paper also reports that BitNet b1.58 is more energy-efficient, with a 71.4 times reduction in arithmetic operations energy consumption for matrix multiplication on 7nm chips. Furthermore, BitNet b1.58 achieves higher throughput on 70B models, supporting up to 11 times the batch size of LLaMA LLM, resulting in an 8.9 times higher throughput. The paper discusses potential future directions, including the use of 1.58-bit LLMs for long sequence inference, edge and mobile devices, and the design of new hardware optimized for 1-bit LLMs. The results demonstrate that 1.58-bit LLMs offer a Pareto improvement over state-of-the-art models, providing high performance with significantly reduced cost.The paper introduces BitNet b1.58, a 1-bit Large Language Model (LLM) variant where each parameter is ternary (−1, 0, 1). This model matches the performance of full-precision LLMs (e.g., FP16) in terms of perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. BitNet b1.58 defines a new scaling law and training recipe for high-performance, cost-effective LLMs and enables a new computation paradigm, opening the door for hardware optimized for 1-bit LLMs. BitNet b1.58 is based on the BitNet architecture, which replaces nn.Linear with BitLinear. It is trained from scratch with 1.58-bit weights and 8-bit activations. The model uses an absmean quantization function to constrain weights to −1, 0, or +1. It also incorporates LLaMA-alike components such as RMSNorm, SwiGLU, and rotary embedding, and removes all biases, making it compatible with open-source frameworks. Experiments show that BitNet b1.58 outperforms full-precision LLMs in terms of speed, memory usage, and energy efficiency. For example, a 3.9B BitNet b1.58 model is 2.4 times faster and uses 3.32 times less memory than a 3B LLaMA LLM. The model also achieves similar performance to full-precision baselines starting from a 3B size. At 70B, BitNet b1.58 is 4.1 times faster and uses significantly less memory than the LLaMA LLM. The paper also reports that BitNet b1.58 is more energy-efficient, with a 71.4 times reduction in arithmetic operations energy consumption for matrix multiplication on 7nm chips. Furthermore, BitNet b1.58 achieves higher throughput on 70B models, supporting up to 11 times the batch size of LLaMA LLM, resulting in an 8.9 times higher throughput. The paper discusses potential future directions, including the use of 1.58-bit LLMs for long sequence inference, edge and mobile devices, and the design of new hardware optimized for 1-bit LLMs. The results demonstrate that 1.58-bit LLMs offer a Pareto improvement over state-of-the-art models, providing high performance with significantly reduced cost.
Reach us at info@study.space