APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

June 23-27, 2024 | Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, Hao Yu
APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models This paper proposes APTQ, an attention-aware post-training mixed-precision quantization method for large language models (LLMs). APTQ considers both the second-order information of each layer's weights and the nonlinear effects of attention outputs on the entire model. It leverages the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction that retains model performance. Experiments show that APTQ surpasses previous quantization methods, achieving an average of 5.22 perplexity nearly equivalent to full precision in the C4 dataset. In addition, APTQ attains state-of-the-art zero-shot accuracy of 68.24% and 70.48% at an average bitwidth of 3.8 in LLaMa-7B and LLaMa-13B, respectively. APTQ is designed to consider the quantization optimization problem within the scope of the attention block, including the nonlinear softmax operation. It utilizes gradients derived from the attention output and develops a second-order Hessian optimization strategy to quantize the weights. By doing so, APTQ significantly reduces the quantization error in these crucial components, thereby preserving the model's integrity throughout compression. APTQ proposes a novel Hessian trace-based quantization sensitivity metric to implement mixed-precision quantization to further compress LLM models. This approach judiciously applies varying bitwidths across the model parameters to fit the limited memory size on edge devices with balanced size and accuracy. As a result, APTQ constitutes a mixed-precision 2/4-bit hybrid scheme with performance comparable to a uniform 4-bit representation. In particular, APTQ produces a compressed model close to its full-precision counterpart, and outperforms the GPTQ method especially in the realm of ultra-low-bit quantization scenarios. The main contributions of this paper are threefold: (1) This is the first work to quantize LLMs by integrating the attention-based gradients with second-order Hessian optimization, leading to a nuanced update mechanism that enhances the precision throughout the quantization process. (2) An innovative Hessian trace-driven mixed-precision quantization scheme is proposed that judiciously allocates high/low bitwidths across different layers based on their sensitivity, optimizing model performance while maintaining efficiency. (3) Through extensive experimentation on the LLaMa models, APTQ not only achieves state-of-the-art (SOTA) results on the C4 dataset but also attains near full-precision perplexity at an average quantization of 4 bits. In zero-shot tasks, APTQ also demonstrates superior performance compared to the SOTA approaches.APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models This paper proposes APTQ, an attention-aware post-training mixed-precision quantization method for large language models (LLMs). APTQ considers both the second-order information of each layer's weights and the nonlinear effects of attention outputs on the entire model. It leverages the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction that retains model performance. Experiments show that APTQ surpasses previous quantization methods, achieving an average of 5.22 perplexity nearly equivalent to full precision in the C4 dataset. In addition, APTQ attains state-of-the-art zero-shot accuracy of 68.24% and 70.48% at an average bitwidth of 3.8 in LLaMa-7B and LLaMa-13B, respectively. APTQ is designed to consider the quantization optimization problem within the scope of the attention block, including the nonlinear softmax operation. It utilizes gradients derived from the attention output and develops a second-order Hessian optimization strategy to quantize the weights. By doing so, APTQ significantly reduces the quantization error in these crucial components, thereby preserving the model's integrity throughout compression. APTQ proposes a novel Hessian trace-based quantization sensitivity metric to implement mixed-precision quantization to further compress LLM models. This approach judiciously applies varying bitwidths across the model parameters to fit the limited memory size on edge devices with balanced size and accuracy. As a result, APTQ constitutes a mixed-precision 2/4-bit hybrid scheme with performance comparable to a uniform 4-bit representation. In particular, APTQ produces a compressed model close to its full-precision counterpart, and outperforms the GPTQ method especially in the realm of ultra-low-bit quantization scenarios. The main contributions of this paper are threefold: (1) This is the first work to quantize LLMs by integrating the attention-based gradients with second-order Hessian optimization, leading to a nuanced update mechanism that enhances the precision throughout the quantization process. (2) An innovative Hessian trace-driven mixed-precision quantization scheme is proposed that judiciously allocates high/low bitwidths across different layers based on their sensitivity, optimizing model performance while maintaining efficiency. (3) Through extensive experimentation on the LLaMa models, APTQ not only achieves state-of-the-art (SOTA) results on the C4 dataset but also attains near full-precision perplexity at an average quantization of 4 bits. In zero-shot tasks, APTQ also demonstrates superior performance compared to the SOTA approaches.
Reach us at info@study.space
[slides and audio] APTQ%3A Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models