[slides] QServe%3A W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

QServe is a system co-design for efficient large language model (LLM) serving, focusing on quantization and system optimization. The paper introduces QoQ, a W4A8KV4 quantization algorithm that combines 4-bit weights, 8-bit activations, and 4-bit KV caches. This approach aims to reduce runtime overhead and improve performance in large-batch, cloud-based LLM serving. Key contributions include: 1. **QoQ Algorithm**: QoQ employs progressive group quantization to ensure all computations are performed on INT8 tensor cores, reducing dequantization overhead. 2. **SmoothAttention**: This technique mitigates accuracy degradation by shifting the quantization challenge from keys to queries, which are not quantized. 3. **System Co-design**: QServe includes compute-aware weight reordering, register-level parallelism, and fused attention memory-bound optimization to further enhance performance. Evaluations show that QServe achieves significant throughput improvements over state-of-the-art systems like TensorRT-LLM, Atom, and QuaRot. On A100 GPUs, QServe improves throughput by 1.2× to 2.4×, and on L40S GPUs, it improves by 1.5× to 3.5× compared to TensorRT-LLM. Notably, QServe reduces the cost of LLM serving by 3×, demonstrating its effectiveness in reducing the financial burden of LLM deployment.QServe is a system co-design for efficient large language model (LLM) serving, focusing on quantization and system optimization. The paper introduces QoQ, a W4A8KV4 quantization algorithm that combines 4-bit weights, 8-bit activations, and 4-bit KV caches. This approach aims to reduce runtime overhead and improve performance in large-batch, cloud-based LLM serving. Key contributions include: 1. **QoQ Algorithm**: QoQ employs progressive group quantization to ensure all computations are performed on INT8 tensor cores, reducing dequantization overhead. 2. **SmoothAttention**: This technique mitigates accuracy degradation by shifting the quantization challenge from keys to queries, which are not quantized. 3. **System Co-design**: QServe includes compute-aware weight reordering, register-level parallelism, and fused attention memory-bound optimization to further enhance performance. Evaluations show that QServe achieves significant throughput improvements over state-of-the-art systems like TensorRT-LLM, Atom, and QuaRot. On A100 GPUs, QServe improves throughput by 1.2× to 2.4×, and on L40S GPUs, it improves by 1.5× to 3.5× compared to TensorRT-LLM. Notably, QServe reduces the cost of LLM serving by 3×, demonstrating its effectiveness in reducing the financial burden of LLM deployment.

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

10 May 2024 | Yujun Lin, Haotian Tang, Shang Yang*, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

10 May 2024 | Yujun Lin*, Haotian Tang*, Shang Yang*, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han

10 May 2024 | Yujun Lin, Haotian Tang, Shang Yang*, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han