23 Feb 2024 | Mart van Baalen * 1 Andrey Kuzmin * 1 Markus Nagel 1 Peter Couperus 1 Cedric Bastoul 1 Eric Mahurin 1 Tijmen Blankevoort 1 Paul Whatmough 1
This paper introduces GPTVQ, a novel method for post-training vector quantization (VQ) that significantly improves the size versus accuracy trade-off in large language models (LLMs). GPTVQ interleaves the quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. The method initializes codebooks using an efficient data-aware version of the EM algorithm and then updates and compresses them using integer quantization and SVD-based compression. GPTVQ achieves state-of-the-art performance on a wide range of LLMs, such as Llama-v2 and Mistral, while being efficient, with processing times ranging from 3 to 11 hours for a 70B parameter model on a single H100 GPU. Additionally, GPTVQ demonstrates improved latency compared to 4-bit integer format on mobile CPUs. The source code is available at <https://github.com/qualcomm-ai-research/gptvq>.This paper introduces GPTVQ, a novel method for post-training vector quantization (VQ) that significantly improves the size versus accuracy trade-off in large language models (LLMs). GPTVQ interleaves the quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. The method initializes codebooks using an efficient data-aware version of the EM algorithm and then updates and compresses them using integer quantization and SVD-based compression. GPTVQ achieves state-of-the-art performance on a wide range of LLMs, such as Llama-v2 and Mistral, while being efficient, with processing times ranging from 3 to 11 hours for a 70B parameter model on a single H100 GPU. Additionally, GPTVQ demonstrates improved latency compared to 4-bit integer format on mobile CPUs. The source code is available at <https://github.com/qualcomm-ai-research/gptvq>.