QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

10 May 2024 | Yujun Lin*, Haotian Tang*, Shang Yang*, Zhe Kai Zhang, Guangxuan Xiao, Chuang Gan, Song Han
QServe is a system designed for efficient large language model (LLM) serving, focusing on W4A8KV4 quantization, which involves 4-bit weights, 8-bit activations, and 4-bit key-value (KV) caches. The system addresses the challenge of high dequantization overhead in existing INT4 quantization methods, particularly on GPUs. QServe introduces a novel algorithm called QoQ, which combines progressive group quantization, SmoothAttention, and various optimizations to reduce dequantization latency and improve throughput. QoQ employs progressive group quantization, first quantizing weights to 8 bits using per-channel FP16 scales, then quantizing these 8-bit intermediates to 4 bits. This approach ensures that all GEMMs are performed on INT8 tensor cores. SmoothAttention mitigates accuracy degradation from 4-bit KV quantization by shifting the challenge of activation quantization from keys to queries. QServe also implements compute-aware weight reordering and register-level parallelism to reduce dequantization latency and harness the performance gain from KV4 quantization. The system achieves significant improvements in serving throughput compared to state-of-the-art systems like TensorRT-LLM. For example, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2× on A100 and 1.4× on L40S, and Qwen1.5-72B by 2.4× on A100 and 3.5× on L40S. Notably, QServe on L40S can achieve higher throughput than TensorRT-LLM on A100, reducing the dollar cost of LLM serving by 3×. QServe also includes optimizations for attention computation, such as KV4 attention, which is memory-bound and leverages the performance gain from KV4 quantization. The system's design ensures that attention kernels remain within the memory-bound region, where low-bit quantization can effectively enhance throughput. QServe's W4A8 GEMM kernel achieves 1.5× speedup over the W8A8 cuBLAS GEMM. The system's performance is validated across seven widely-used LLMs on A100 and L40S GPUs, demonstrating significant improvements in throughput and accuracy compared to existing systems. QServe's approach to quantization and system co-design effectively reduces the overhead of dequantization and improves the efficiency of LLM serving.QServe is a system designed for efficient large language model (LLM) serving, focusing on W4A8KV4 quantization, which involves 4-bit weights, 8-bit activations, and 4-bit key-value (KV) caches. The system addresses the challenge of high dequantization overhead in existing INT4 quantization methods, particularly on GPUs. QServe introduces a novel algorithm called QoQ, which combines progressive group quantization, SmoothAttention, and various optimizations to reduce dequantization latency and improve throughput. QoQ employs progressive group quantization, first quantizing weights to 8 bits using per-channel FP16 scales, then quantizing these 8-bit intermediates to 4 bits. This approach ensures that all GEMMs are performed on INT8 tensor cores. SmoothAttention mitigates accuracy degradation from 4-bit KV quantization by shifting the challenge of activation quantization from keys to queries. QServe also implements compute-aware weight reordering and register-level parallelism to reduce dequantization latency and harness the performance gain from KV4 quantization. The system achieves significant improvements in serving throughput compared to state-of-the-art systems like TensorRT-LLM. For example, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2× on A100 and 1.4× on L40S, and Qwen1.5-72B by 2.4× on A100 and 3.5× on L40S. Notably, QServe on L40S can achieve higher throughput than TensorRT-LLM on A100, reducing the dollar cost of LLM serving by 3×. QServe also includes optimizations for attention computation, such as KV4 attention, which is memory-bound and leverages the performance gain from KV4 quantization. The system's design ensures that attention kernels remain within the memory-bound region, where low-bit quantization can effectively enhance throughput. QServe's W4A8 GEMM kernel achieves 1.5× speedup over the W8A8 cuBLAS GEMM. The system's performance is validated across seven widely-used LLMs on A100 and L40S GPUs, demonstrating significant improvements in throughput and accuracy compared to existing systems. QServe's approach to quantization and system co-design effectively reduces the overhead of dequantization and improves the efficiency of LLM serving.
Reach us at info@study.space