AFFINEQUANT: AFFINE TRANSFORMATION QUANTIZATION FOR LARGE LANGUAGE MODELS

AFFINEQUANT: AFFINE TRANSFORMATION QUANTIZATION FOR LARGE LANGUAGE MODELS

2024 | Yue Xiao Ma, Huixia Li, Xiaowu Zheng, Feng ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, Rongrong Ji
AffineQuant is a post-training quantization (PTQ) method that uses equivalent affine transformations to minimize quantization errors in large language models (LLMs). Unlike existing PTQ methods that limit optimization to scaling transformations between pre- and post-quantization weights, AffineQuant directly optimizes equivalent affine transformations, expanding the optimization scope and significantly reducing quantization errors. This approach ensures equivalence between pre- and post-quantization outputs, maintaining efficiency and generalization. To ensure invertibility during optimization, a gradual mask method is introduced, aligning with the Levy-Desplanques theorem, which theoretically guarantees invertibility of the transformation. AffineQuant achieves significant performance improvements across various LLMs and datasets, particularly under low-bit quantization, enabling deployment on edge devices. For example, on the LLaMA2-7B model with W4A4 quantization, AffineQuant achieves a C4 perplexity of 15.76, outperforming OmniQuant by 2.26. On LLaMA-30B with 4/4-bit quantization, AffineQuant achieves 58.61% accuracy on six zero-shot tasks, surpassing OmniQuant by 1.98%. AffineQuant also demonstrates superior inference efficiency and performance in zero-shot tasks. The method is consistent with other methods after matrix merging and achieves state-of-the-art performance in LLM quantization, particularly for small-scale models or low-bit configurations. The contributions include proposing a novel affine transform in PTQ, a novel optimization algorithm ensuring invertibility, and achieving state-of-the-art performance in LLM quantization. The method is supported by extensive experiments and comparisons with existing PTQ methods, demonstrating its effectiveness in reducing quantization errors and improving model performance.AffineQuant is a post-training quantization (PTQ) method that uses equivalent affine transformations to minimize quantization errors in large language models (LLMs). Unlike existing PTQ methods that limit optimization to scaling transformations between pre- and post-quantization weights, AffineQuant directly optimizes equivalent affine transformations, expanding the optimization scope and significantly reducing quantization errors. This approach ensures equivalence between pre- and post-quantization outputs, maintaining efficiency and generalization. To ensure invertibility during optimization, a gradual mask method is introduced, aligning with the Levy-Desplanques theorem, which theoretically guarantees invertibility of the transformation. AffineQuant achieves significant performance improvements across various LLMs and datasets, particularly under low-bit quantization, enabling deployment on edge devices. For example, on the LLaMA2-7B model with W4A4 quantization, AffineQuant achieves a C4 perplexity of 15.76, outperforming OmniQuant by 2.26. On LLaMA-30B with 4/4-bit quantization, AffineQuant achieves 58.61% accuracy on six zero-shot tasks, surpassing OmniQuant by 1.98%. AffineQuant also demonstrates superior inference efficiency and performance in zero-shot tasks. The method is consistent with other methods after matrix merging and achieves state-of-the-art performance in LLM quantization, particularly for small-scale models or low-bit configurations. The contributions include proposing a novel affine transform in PTQ, a novel optimization algorithm ensuring invertibility, and achieving state-of-the-art performance in LLM quantization. The method is supported by extensive experiments and comparisons with existing PTQ methods, demonstrating its effectiveness in reducing quantization errors and improving model performance.
Reach us at info@study.space