8 Jun 2024 | Vage Egiazarian * 1 2 Andrei Panferov * 1 2 Denis Kuznedelev 2 3 Elias Frantar 4 Artem Babenko 2 Dan Alistarh 4 5
The paper introduces AQLM (Additive Quantization of Language Models), a novel approach to extreme compression of large language models (LLMs) using Multi-Codebook Quantization (MCQ). AQLM generalizes Additive Quantization (AQ) to compress LLM weights, achieving Pareto optimality in terms of accuracy and model size when compressing to less than 3 bits per parameter. The key innovations include learned additive quantization of weight matrices and joint optimization of codebook parameters across transformer blocks. AQLM outperforms existing methods in the extreme 2-bit compression regime and is practical for inference, with fast GPU and CPU implementations that match or outperform optimized FP16 implementations in speed while reducing memory footprint by up to 8x. The paper evaluates AQLM on Llama 2 models and demonstrates its effectiveness through detailed ablations and comparisons with state-of-the-art methods.The paper introduces AQLM (Additive Quantization of Language Models), a novel approach to extreme compression of large language models (LLMs) using Multi-Codebook Quantization (MCQ). AQLM generalizes Additive Quantization (AQ) to compress LLM weights, achieving Pareto optimality in terms of accuracy and model size when compressing to less than 3 bits per parameter. The key innovations include learned additive quantization of weight matrices and joint optimization of codebook parameters across transformer blocks. AQLM outperforms existing methods in the extreme 2-bit compression regime and is practical for inference, with fast GPU and CPU implementations that match or outperform optimized FP16 implementations in speed while reducing memory footprint by up to 8x. The paper evaluates AQLM on Llama 2 models and demonstrates its effectiveness through detailed ablations and comparisons with state-of-the-art methods.