SpinQuant is a novel quantization technique that significantly improves the accuracy of large language models (LLMs) by leveraging learned rotations to reduce quantization errors. The method applies rotation matrices to weights, activations, and KV-cache to mitigate the impact of outliers, which are known to degrade quantization performance. By optimizing these rotation matrices using Cayley optimization, SpinQuant achieves a much smaller accuracy gap between quantized and full-precision models. For example, on the LLaMA-2 7B model, SpinQuant narrows the accuracy gap to just 2.9 points, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points. SpinQuant also outperforms QuaRot, a competing method that uses random rotations to remove outliers, by reducing the gap to full precision by 30.2% for LLaMA-2 7B and 34.1% for LLaMA-3 8B models. The method is compatible with advanced quantization techniques like GPTQ and demonstrates state-of-the-art performance on zero-shot reasoning tasks. SpinQuant's approach involves inserting rotation matrices into different parts of the LLM architecture, such as the residual stream, attention blocks, and feed-forward layers, to enhance quantization accuracy while maintaining numerical invariance. The rotation matrices are optimized using Cayley SGD, which allows for efficient optimization on the Stiefel manifold. The results show that SpinQuant significantly improves quantization accuracy across various LLM models, including LLaMA-2 7B, LLaMA-2 13B, LLaMA-2 70B, and LLaMA-3 8B and 70B. The method is effective in reducing quantization errors and improving the performance of LLMs in both on-device and server-side inference scenarios.SpinQuant is a novel quantization technique that significantly improves the accuracy of large language models (LLMs) by leveraging learned rotations to reduce quantization errors. The method applies rotation matrices to weights, activations, and KV-cache to mitigate the impact of outliers, which are known to degrade quantization performance. By optimizing these rotation matrices using Cayley optimization, SpinQuant achieves a much smaller accuracy gap between quantized and full-precision models. For example, on the LLaMA-2 7B model, SpinQuant narrows the accuracy gap to just 2.9 points, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points. SpinQuant also outperforms QuaRot, a competing method that uses random rotations to remove outliers, by reducing the gap to full precision by 30.2% for LLaMA-2 7B and 34.1% for LLaMA-3 8B models. The method is compatible with advanced quantization techniques like GPTQ and demonstrates state-of-the-art performance on zero-shot reasoning tasks. SpinQuant's approach involves inserting rotation matrices into different parts of the LLM architecture, such as the residual stream, attention blocks, and feed-forward layers, to enhance quantization accuracy while maintaining numerical invariance. The rotation matrices are optimized using Cayley SGD, which allows for efficient optimization on the Stiefel manifold. The results show that SpinQuant significantly improves quantization accuracy across various LLM models, including LLaMA-2 7B, LLaMA-2 13B, LLaMA-2 70B, and LLaMA-3 8B and 70B. The method is effective in reducing quantization errors and improving the performance of LLMs in both on-device and server-side inference scenarios.