**Abstract:**
Post-training quantization (PTQ) techniques significantly reduce memory usage, latency, and power consumption in Large Language Models (LLMs), but can introduce quantization errors, especially with outliers. Recent studies suggest that rotating activation or weight matrices can help remove outliers and improve quantization. This paper identifies applicable rotation parameterizations that preserve full-precision output and finds that random rotations vary in their quantization performance, with up to 13 points difference in downstream zero-shot reasoning tasks. SpinQuant optimizes rotation matrices using Cayley optimization on a small validation set, achieving 4-bit quantization of weights, activations, and KV-cache with an accuracy gap of only 2.9 points on the LLaMA-2 7B model, outperforming existing methods by significant margins.
**Introduction:**
LLMs have demonstrated impressive performance across various applications, but inference costs remain a challenge. PTQ techniques, such as quantizing weights and activations, reduce memory usage and latency. However, outliers in the data can stretch the quantization range, leading to fewer effective bits for most values. Random rotations have been shown to reduce outliers and enhance quantifiability. SpinQuant optimizes rotation matrices to minimize the final loss of the quantized network, improving accuracy and reducing the gap to full-precision.
**Method:**
SpinQuant introduces rotation parameterizations for popular LLM architectures, including residual and attention blocks, and uses Cayley optimization to optimize these rotations. The method maintains numerical invariance and refines intermediate activations and weights, making them more quantization-friendly. Cayley SGD is used to optimize the rotation matrices, which are constrained to be orthonormal.
**Experiments:**
Experiments on LLaMA-2 and LLaMA-3 models show that SpinQuant significantly improves accuracy in 4-bit quantization scenarios, outperforming state-of-the-art methods. The method is also compatible with advanced quantization techniques like GPTQ.
**Conclusion:**
SpinQuant effectively bridges the performance gap between full precision and 4-bit quantization, leveraging rotation invariance to reduce outliers and optimize quantization. The method is compatible with advanced quantization techniques and demonstrates state-of-the-art performance.**Abstract:**
Post-training quantization (PTQ) techniques significantly reduce memory usage, latency, and power consumption in Large Language Models (LLMs), but can introduce quantization errors, especially with outliers. Recent studies suggest that rotating activation or weight matrices can help remove outliers and improve quantization. This paper identifies applicable rotation parameterizations that preserve full-precision output and finds that random rotations vary in their quantization performance, with up to 13 points difference in downstream zero-shot reasoning tasks. SpinQuant optimizes rotation matrices using Cayley optimization on a small validation set, achieving 4-bit quantization of weights, activations, and KV-cache with an accuracy gap of only 2.9 points on the LLaMA-2 7B model, outperforming existing methods by significant margins.
**Introduction:**
LLMs have demonstrated impressive performance across various applications, but inference costs remain a challenge. PTQ techniques, such as quantizing weights and activations, reduce memory usage and latency. However, outliers in the data can stretch the quantization range, leading to fewer effective bits for most values. Random rotations have been shown to reduce outliers and enhance quantifiability. SpinQuant optimizes rotation matrices to minimize the final loss of the quantized network, improving accuracy and reducing the gap to full-precision.
**Method:**
SpinQuant introduces rotation parameterizations for popular LLM architectures, including residual and attention blocks, and uses Cayley optimization to optimize these rotations. The method maintains numerical invariance and refines intermediate activations and weights, making them more quantization-friendly. Cayley SGD is used to optimize the rotation matrices, which are constrained to be orthonormal.
**Experiments:**
Experiments on LLaMA-2 and LLaMA-3 models show that SpinQuant significantly improves accuracy in 4-bit quantization scenarios, outperforming state-of-the-art methods. The method is also compatible with advanced quantization techniques like GPTQ.
**Conclusion:**
SpinQuant effectively bridges the performance gap between full precision and 4-bit quantization, leveraging rotation invariance to reduce outliers and optimize quantization. The method is compatible with advanced quantization techniques and demonstrates state-of-the-art performance.