Rotation and Permutation for Advanced Outlier Management and Efficient Quantization of LLMs

Rotation and Permutation for Advanced Outlier Management and Efficient Quantization of LLMs

3 Jun 2024 | Haokun Lin*, 1,3,4, Haobo Xu* 2, Yichen Wu* 4, Jingzhi Cui2, Yingtao Zhang2, Linzhan Mou5, Linqi Song4, Zhenan Sun† 1,3, Ying Wei† 6
The paper introduces DuQuant, an innovative quantization strategy for large language models (LLMs) that effectively addresses the challenge of outlier activations. Traditional approaches focus on Normal Outliers, which are activations with consistently high magnitudes across all tokens, but struggle with Massive Outliers, which are significantly higher in value and cause substantial performance losses during low-bit quantization. DuQuant employs rotation and permutation transformations to more effectively eliminate both types of outliers. The rotation transformation redistributes outliers across adjacent channels within different rotation blocks, while the zigzag permutation ensures a balanced distribution of outliers among blocks, minimizing block-wise variance. An additional rotation further enhances the smoothness of the activation landscape. Extensive evaluations demonstrate that DuQuant significantly outperforms existing 4-bit weight-activation quantization baselines across various benchmarks, achieving top-tier results in multiple tasks with various LLM architectures. Notably, DuQuant achieves a 5% improvement in Commonsense QA tasks across all LLaMA model sizes and a 10% increase in zero-shot MMLU benchmarks for the Vicuna-v1.5-13B model. In practical applications, DuQuant not only accelerates the pre-filing phase by up to 2.08× but also reduces memory usage by 3.20×, with minimal impact on performance. The code for DuQuant is available at https://github.com/Hsu1023/DuQuant.The paper introduces DuQuant, an innovative quantization strategy for large language models (LLMs) that effectively addresses the challenge of outlier activations. Traditional approaches focus on Normal Outliers, which are activations with consistently high magnitudes across all tokens, but struggle with Massive Outliers, which are significantly higher in value and cause substantial performance losses during low-bit quantization. DuQuant employs rotation and permutation transformations to more effectively eliminate both types of outliers. The rotation transformation redistributes outliers across adjacent channels within different rotation blocks, while the zigzag permutation ensures a balanced distribution of outliers among blocks, minimizing block-wise variance. An additional rotation further enhances the smoothness of the activation landscape. Extensive evaluations demonstrate that DuQuant significantly outperforms existing 4-bit weight-activation quantization baselines across various benchmarks, achieving top-tier results in multiple tasks with various LLM architectures. Notably, DuQuant achieves a 5% improvement in Commonsense QA tasks across all LLaMA model sizes and a 10% increase in zero-shot MMLU benchmarks for the Vicuna-v1.5-13B model. In practical applications, DuQuant not only accelerates the pre-filing phase by up to 2.08× but also reduces memory usage by 3.20×, with minimal impact on performance. The code for DuQuant is available at https://github.com/Hsu1023/DuQuant.
Reach us at info@study.space
[slides and audio] DuQuant%3A Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs