GPTVQ: The Blessing of Dimensionality for LLM Quantization

GPTVQ: The Blessing of Dimensionality for LLM Quantization

23 Feb 2024 | Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough
GPTVQ is a novel method for post-training vector quantization (VQ) of large language models (LLMs), which significantly improves the size vs accuracy trade-off. The method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm, and then updated and further compressed using integer quantization and SVD-based compression. GPTVQ achieves state-of-the-art performance on a wide range of LLMs, including Llama-v2 and Mistral. It is efficient, taking between 3 and 11 hours to process a Llama-v2-70B model on a single H100. Additionally, VQ leads to improved latency compared to using a 4-bit integer format on mobile CPUs. The method is accurate, fast, and scalable, and is suitable for significantly-sized LLMs. The results show that non-uniform quantization using GPTVQ generally yields improved results over uniform PTQ methods, with significant improvements at low bitwidths and for very large models. The method also reduces data transfer latency and model footprint, making it a feasible alternative to uniform quantization for compression.GPTVQ is a novel method for post-training vector quantization (VQ) of large language models (LLMs), which significantly improves the size vs accuracy trade-off. The method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm, and then updated and further compressed using integer quantization and SVD-based compression. GPTVQ achieves state-of-the-art performance on a wide range of LLMs, including Llama-v2 and Mistral. It is efficient, taking between 3 and 11 hours to process a Llama-v2-70B model on a single H100. Additionally, VQ leads to improved latency compared to using a 4-bit integer format on mobile CPUs. The method is accurate, fast, and scalable, and is suitable for significantly-sized LLMs. The results show that non-uniform quantization using GPTVQ generally yields improved results over uniform PTQ methods, with significant improvements at low bitwidths and for very large models. The method also reduces data transfer latency and model footprint, making it a feasible alternative to uniform quantization for compression.
Reach us at info@study.space
[slides and audio] GPTVQ%3A The Blessing of Dimensionality for LLM Quantization