21 Jun 2024 | Baohao Liao, Christian Herold, Shahram Khadivi, Christof Monz
ApiQ is a novel quantization framework designed to reduce activation error during the quantization of large language models (LLMs) while maintaining the performance of parameter-efficient fine-tuning. The framework addresses the challenges of quantization by jointly initializing the LoRA components and quantizing the LLM's weights, ensuring that the output of the quantized LLM remains consistent with that of the full-precision LLM. This approach minimizes the error propagation from shallower to deeper layers, leading to more accurate and stable fine-tuning results across various bit-widths.
ApiQ is evaluated on a range of language tasks using different LLMs, demonstrating superior performance compared to existing methods such as QLoRA and LoftQ. It effectively reduces activation error during quantization, preserving the knowledge from the full-precision LLM and enabling better fine-tuning results. The framework is also efficient in terms of GPU memory usage and can be applied to various parameter-efficient fine-tuning methods, including DoRA.
The key contributions of ApiQ include an in-depth analysis of the challenges associated with fine-tuning quantized LLMs, the proposal of a novel quantization framework that preserves the activation of the full-precision LLM, and the demonstration of superior performance post-quantization, even surpassing the latest post-training quantization techniques. Extensive experiments on five LLMs across five different tasks show that ApiQ consistently outperforms all baselines at various bit levels.ApiQ is a novel quantization framework designed to reduce activation error during the quantization of large language models (LLMs) while maintaining the performance of parameter-efficient fine-tuning. The framework addresses the challenges of quantization by jointly initializing the LoRA components and quantizing the LLM's weights, ensuring that the output of the quantized LLM remains consistent with that of the full-precision LLM. This approach minimizes the error propagation from shallower to deeper layers, leading to more accurate and stable fine-tuning results across various bit-widths.
ApiQ is evaluated on a range of language tasks using different LLMs, demonstrating superior performance compared to existing methods such as QLoRA and LoftQ. It effectively reduces activation error during quantization, preserving the knowledge from the full-precision LLM and enabling better fine-tuning results. The framework is also efficient in terms of GPU memory usage and can be applied to various parameter-efficient fine-tuning methods, including DoRA.
The key contributions of ApiQ include an in-depth analysis of the challenges associated with fine-tuning quantized LLMs, the proposal of a novel quantization framework that preserves the activation of the full-precision LLM, and the demonstration of superior performance post-quantization, even surpassing the latest post-training quantization techniques. Extensive experiments on five LLMs across five different tasks show that ApiQ consistently outperforms all baselines at various bit levels.