30 Mar 2024 | Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman
QuaRot is a novel quantization scheme designed to quantize large language models (LLMs) end-to-end to 4 bits, including all weights, activations, and KV cache. The scheme leverages computational invariance and randomized Hadamard transformations to remove outliers from the hidden state without altering the model's output. This approach simplifies the quantization process for activations and enables the quantization of the hidden state, attention mechanism, and KV cache. The result is a 4-bit quantized model where all matrix multiplications are performed in 4 bits, with minimal accuracy loss. QuaRot achieves up to 2.16x speedup during the prefill phase and 3.39x memory savings during the decoding phase on the Llama2-70B model, while maintaining 99% of the zero-shot task performance. The method is evaluated on language generation and zero-shot tasks, demonstrating superior performance compared to existing quantization techniques.QuaRot is a novel quantization scheme designed to quantize large language models (LLMs) end-to-end to 4 bits, including all weights, activations, and KV cache. The scheme leverages computational invariance and randomized Hadamard transformations to remove outliers from the hidden state without altering the model's output. This approach simplifies the quantization process for activations and enables the quantization of the hidden state, attention mechanism, and KV cache. The result is a 4-bit quantized model where all matrix multiplications are performed in 4 bits, with minimal accuracy loss. QuaRot achieves up to 2.16x speedup during the prefill phase and 3.39x memory savings during the decoding phase on the Llama2-70B model, while maintaining 99% of the zero-shot task performance. The method is evaluated on language generation and zero-shot tasks, demonstrating superior performance compared to existing quantization techniques.