8 Jun 2024 | Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh
This paper introduces AQLM, a novel additive quantization method for compressing large language models (LLMs) to extremely low bit counts, such as 2-3 bits per parameter. AQLM extends the classic Additive Quantization (AQ) approach to LLM compression, introducing two key innovations: 1) learned additive quantization of weight matrices in an input-adaptive manner, and 2) joint optimization of codebook parameters across each transformer block. AQLM is the first scheme that is Pareto optimal in terms of accuracy versus model size when compressing to less than 3 bits per parameter, and significantly outperforms existing methods in the extreme compression (2-bit) regime. It is also practical, with efficient GPU and CPU implementations for token generation that enable matching or outperforming optimized FP16 implementations in terms of speed while using much smaller memory footprints.
The paper evaluates AQLM on the task of compressing accurate open LLMs from the LLAMA 2 family with compression rates of 2-4 bits per parameter. Results show that AQLM outperforms the previous state-of-the-art across the standard 2-4 bit compression range, with the most significant improvements for extreme 2-bit quantization. The algorithm is also evaluated on the Mixtral model, and further accuracy improvements are achieved with enhanced fine-tuning algorithms. AQLM is shown to be practical, with efficient implementations that allow it to match or outperform FP16 in terms of speed, while reducing memory usage by up to 8x. The paper also discusses the limitations of AQLM, including its higher computational cost compared to direct post-training quantization methods, but highlights its efficiency on both CPU and GPU. Future work includes exploring better fine-tuning strategies and generalizing AQLM to other quantization scenarios.This paper introduces AQLM, a novel additive quantization method for compressing large language models (LLMs) to extremely low bit counts, such as 2-3 bits per parameter. AQLM extends the classic Additive Quantization (AQ) approach to LLM compression, introducing two key innovations: 1) learned additive quantization of weight matrices in an input-adaptive manner, and 2) joint optimization of codebook parameters across each transformer block. AQLM is the first scheme that is Pareto optimal in terms of accuracy versus model size when compressing to less than 3 bits per parameter, and significantly outperforms existing methods in the extreme compression (2-bit) regime. It is also practical, with efficient GPU and CPU implementations for token generation that enable matching or outperforming optimized FP16 implementations in terms of speed while using much smaller memory footprints.
The paper evaluates AQLM on the task of compressing accurate open LLMs from the LLAMA 2 family with compression rates of 2-4 bits per parameter. Results show that AQLM outperforms the previous state-of-the-art across the standard 2-4 bit compression range, with the most significant improvements for extreme 2-bit quantization. The algorithm is also evaluated on the Mixtral model, and further accuracy improvements are achieved with enhanced fine-tuning algorithms. AQLM is shown to be practical, with efficient implementations that allow it to match or outperform FP16 in terms of speed, while reducing memory usage by up to 8x. The paper also discusses the limitations of AQLM, including its higher computational cost compared to direct post-training quantization methods, but highlights its efficiency on both CPU and GPU. Future work includes exploring better fine-tuning strategies and generalizing AQLM to other quantization scenarios.