QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

2024-03-30 | Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman
QuaRot is a novel 4-bit quantization method for large language models (LLMs) that eliminates outliers in hidden states, activations, and KV caches, enabling end-to-end 4-bit inference. The method uses randomized Hadamard transformations to rotate the model's inputs, making quantization easier without affecting the output. This approach is applied to the hidden state, activations, attention mechanisms, and KV cache, allowing all matrix multiplications to be performed in 4 bits without any high-precision channels. The QuaRot method achieves up to 2.16x speedup in the prefill phase and 3.39x memory saving during decoding for the LLAMA2-7B model, with a maximum WikiText-2 perplexity loss of 0.63. It preserves 99% of the zero-shot performance of the original model. QuaRot also supports 6- and 8-bit quantization, which is lossless. The method leverages computational invariance and incoherence processing to reduce the impact of outliers, enabling effective quantization of weights, activations, and KV caches. The results show that QuaRot significantly improves the efficiency and performance of LLM inference while maintaining high accuracy.QuaRot is a novel 4-bit quantization method for large language models (LLMs) that eliminates outliers in hidden states, activations, and KV caches, enabling end-to-end 4-bit inference. The method uses randomized Hadamard transformations to rotate the model's inputs, making quantization easier without affecting the output. This approach is applied to the hidden state, activations, attention mechanisms, and KV cache, allowing all matrix multiplications to be performed in 4 bits without any high-precision channels. The QuaRot method achieves up to 2.16x speedup in the prefill phase and 3.39x memory saving during decoding for the LLAMA2-7B model, with a maximum WikiText-2 perplexity loss of 0.63. It preserves 99% of the zero-shot performance of the original model. QuaRot also supports 6- and 8-bit quantization, which is lossless. The method leverages computational invariance and incoherence processing to reduce the impact of outliers, enabling effective quantization of weights, activations, and KV caches. The results show that QuaRot significantly improves the efficiency and performance of LLM inference while maintaining high accuracy.
Reach us at info@study.space
[slides] QuaRot%3A Outlier-Free 4-Bit Inference in Rotated LLMs | StudySpace