[slides and audio] The Era of 1-bit LLMs%3A All Large Language Models are in 1.58 Bits

The paper introduces a new variant of 1-bit Large Language Models (LLMs), named BitNet b1.58, which uses ternary weights (-1, 0, 1) to achieve significant cost savings in latency, memory, throughput, and energy consumption while maintaining performance comparable to full-precision (FP16 or BF16) models. This 1.58-bit approach defines a new scaling law and recipe for training high-performance, cost-effective LLMs. BitNet b1.58 is based on the BitNet architecture, which replaces linear layers with BitLinear, and introduces an $absmean$ quantization function to constrain weights to ternary values. The model is trained from scratch with 1.58-bit weights and 8-bit activations, and it supports feature filtering, enhancing its modeling capabilities. Experiments show that BitNet b1.58 matches full-precision baselines in terms of perplexity and end-task performance starting from a 3B model size. The paper also discusses the energy savings, throughput improvements, and potential applications of 1-bit LLMs, including edge and mobile devices, and calls for the development of specialized hardware optimized for these models.The paper introduces a new variant of 1-bit Large Language Models (LLMs), named BitNet b1.58, which uses ternary weights (-1, 0, 1) to achieve significant cost savings in latency, memory, throughput, and energy consumption while maintaining performance comparable to full-precision (FP16 or BF16) models. This 1.58-bit approach defines a new scaling law and recipe for training high-performance, cost-effective LLMs. BitNet b1.58 is based on the BitNet architecture, which replaces linear layers with BitLinear, and introduces an $absmean$ quantization function to constrain weights to ternary values. The model is trained from scratch with 1.58-bit weights and 8-bit activations, and it supports feature filtering, enhancing its modeling capabilities. Experiments show that BitNet b1.58 matches full-precision baselines in terms of perplexity and end-task performance starting from a 3B model size. The paper also discusses the energy savings, throughput improvements, and potential applications of 1-bit LLMs, including edge and mobile devices, and calls for the development of specialized hardware optimized for these models.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

27 Feb 2024 | Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei