[slides] AffineQuant%3A Affine Transformation Quantization for Large Language Models

The paper "AFFINEQUANT: AFFINE TRANSFORMATION QUANTIZATION FOR LARGE LANGUAGE MODELS" by Yuexiao Ma et al. addresses the significant resource requirements of Large-scale Language Models (LLMs) and proposes a novel quantization method called AffineQuant. The authors aim to minimize quantization errors, particularly in low-bit configurations, and enable the deployment of large models on edge devices. **Key Contributions:** 1. **AffineQuant Method:** This method extends the optimization scope by using equivalent affine transformations, which significantly reduces quantization errors. 2. **Invertibility Assurance:** By employing the inverse matrix, AffineQuant ensures equivalence between pre- and post-quantization outputs, maintaining efficiency and generalization. 3. **Gradual Mask Optimization:** To ensure invertibility during optimization, a gradual mask approach is introduced, focusing on optimizing diagonal elements first and gradually extending to other elements. This method aligns with the Levy-Desplanques theorem, ensuring invertibility of the transformation. **Experimental Results:** - **Performance Improvements:** AffineQuant achieves state-of-the-art performance in LLMs quantization, especially in low-bit or small models. - **Zero-Shot Tasks:** On the LLaMA-30B model, AffineQuant achieves an average accuracy of 58.61% on six zero-shot tasks, outperforming OmniQuant by 1.98%. **Methodology:** - **AffineQuant Optimization:** The optimization problem involves minimizing the mean square error between pre- and post-quantization outputs. - **Gradual Mask:** This technique ensures the invertibility of the affine transformation matrix by gradually unfreezing elements near the diagonal during optimization. **Conclusion:** AffineQuant effectively addresses the limitations of existing PTQ methods by expanding the optimizable weight space and ensuring invertibility, leading to significant performance improvements in LLMs quantization.The paper "AFFINEQUANT: AFFINE TRANSFORMATION QUANTIZATION FOR LARGE LANGUAGE MODELS" by Yuexiao Ma et al. addresses the significant resource requirements of Large-scale Language Models (LLMs) and proposes a novel quantization method called AffineQuant. The authors aim to minimize quantization errors, particularly in low-bit configurations, and enable the deployment of large models on edge devices. **Key Contributions:** 1. **AffineQuant Method:** This method extends the optimization scope by using equivalent affine transformations, which significantly reduces quantization errors. 2. **Invertibility Assurance:** By employing the inverse matrix, AffineQuant ensures equivalence between pre- and post-quantization outputs, maintaining efficiency and generalization. 3. **Gradual Mask Optimization:** To ensure invertibility during optimization, a gradual mask approach is introduced, focusing on optimizing diagonal elements first and gradually extending to other elements. This method aligns with the Levy-Desplanques theorem, ensuring invertibility of the transformation. **Experimental Results:** - **Performance Improvements:** AffineQuant achieves state-of-the-art performance in LLMs quantization, especially in low-bit or small models. - **Zero-Shot Tasks:** On the LLaMA-30B model, AffineQuant achieves an average accuracy of 58.61% on six zero-shot tasks, outperforming OmniQuant by 1.98%. **Methodology:** - **AffineQuant Optimization:** The optimization problem involves minimizing the mean square error between pre- and post-quantization outputs. - **Gradual Mask:** This technique ensures the invertibility of the affine transformation matrix by gradually unfreezing elements near the diagonal during optimization. **Conclusion:** AffineQuant effectively addresses the limitations of existing PTQ methods by expanding the optimizable weight space and ensuring invertibility, leading to significant performance improvements in LLMs quantization.

AFFINEQUANT: AFFINE TRANSFORMATION QUANTIZATION FOR LARGE LANGUAGE MODELS

19 Mar 2024 | Yuexiao Ma1; Huixia Li2, Xiawu Zheng1,3,4, Feng ling2, Xuefeng Xiao2, Rui Wang2, Shilei Wen2, Fei Chao1, Rongrong Ji1,4†