[slides] APTQ%3A Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

The paper introduces APTQ (Attention-aware Post-Training Mixed-Precision Quantization), a novel method for quantizing large language models (LLMs) to reduce computational load and model size for deployment on edge devices. APTQ integrates attention-based gradients with second-order Hessian optimization, considering the nonlinear dynamics of attention mechanisms. It uses the Hessian trace as a sensitivity metric to implement mixed-precision quantization, allocating varying bitwidths across layers based on their sensitivity. Experiments on the LLaMA-7B and LLaMA-13B models show that APTQ achieves state-of-the-art perplexity and zero-shot accuracy at average bitwidths of 4 bits and 3.8 bits, respectively, demonstrating its effectiveness in producing high-quality quantized LLMs. The method outperforms previous quantization techniques, particularly in ultra-low-bit quantization scenarios.The paper introduces APTQ (Attention-aware Post-Training Mixed-Precision Quantization), a novel method for quantizing large language models (LLMs) to reduce computational load and model size for deployment on edge devices. APTQ integrates attention-based gradients with second-order Hessian optimization, considering the nonlinear dynamics of attention mechanisms. It uses the Hessian trace as a sensitivity metric to implement mixed-precision quantization, allocating varying bitwidths across layers based on their sensitivity. Experiments on the LLaMA-7B and LLaMA-13B models show that APTQ achieves state-of-the-art perplexity and zero-shot accuracy at average bitwidths of 4 bits and 3.8 bits, respectively, demonstrating its effectiveness in producing high-quality quantized LLMs. The method outperforms previous quantization techniques, particularly in ultra-low-bit quantization scenarios.

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

June 23–27, 2024 | Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, Hao Yu