LLM-PQ is a system designed to improve the efficiency of large language model (LLM) serving on heterogeneous GPU clusters. It introduces adaptive model quantization and phase-aware partitioning to optimize inference throughput while meeting user-specified model quality targets. The system addresses the challenges of serving LLMs on clusters with a mix of high- and low-capacity GPUs, which are common in practical AI and machine learning environments. LLM-PQ jointly determines quantization precision, model layer partitioning, and micro-batch sizing strategies based on the LLM, available resources, and user requirements.
The system uses a cost model to predict memory and latency requirements under mixed-precision quantization. It also introduces an indicator to measure the sensitivity of model layers to different quantization levels, enabling adaptive quantization that balances memory usage and model quality. LLM-PQ employs an iterative algorithm to explore possible GPU orderings and micro-batch sizes, then solves an integer linear programming problem to determine the optimal partition and quantization bitwidths.
LLM-PQ is evaluated on 11 different clusters, demonstrating up to 2.88× throughput improvement compared to state-of-the-art approaches. The system achieves this by efficiently utilizing heterogeneous GPU resources, reducing memory waste, and improving model quality. It also supports phase-aware model partitioning, which is crucial for handling the two-phase inference process of LLMs (prefill and decode). The system is implemented with a distributed runtime that handles preprocessing, postprocessing, and micro-batch scheduling for different generation phases.
LLM-PQ outperforms existing solutions in heterogeneous clusters by better utilizing memory and conducting phase-aware and precision-aware model partitions. It also performs well in homogeneous clusters, achieving throughput gains despite the smaller scale of the clusters. The system is designed to handle both online and offline serving tasks, with a focus on the offline task where prompt lengths and token generation numbers are known in advance. LLM-PQ's adaptive quantization and phase-aware partitioning strategies make it a promising solution for efficient LLM serving on heterogeneous GPU clusters.LLM-PQ is a system designed to improve the efficiency of large language model (LLM) serving on heterogeneous GPU clusters. It introduces adaptive model quantization and phase-aware partitioning to optimize inference throughput while meeting user-specified model quality targets. The system addresses the challenges of serving LLMs on clusters with a mix of high- and low-capacity GPUs, which are common in practical AI and machine learning environments. LLM-PQ jointly determines quantization precision, model layer partitioning, and micro-batch sizing strategies based on the LLM, available resources, and user requirements.
The system uses a cost model to predict memory and latency requirements under mixed-precision quantization. It also introduces an indicator to measure the sensitivity of model layers to different quantization levels, enabling adaptive quantization that balances memory usage and model quality. LLM-PQ employs an iterative algorithm to explore possible GPU orderings and micro-batch sizes, then solves an integer linear programming problem to determine the optimal partition and quantization bitwidths.
LLM-PQ is evaluated on 11 different clusters, demonstrating up to 2.88× throughput improvement compared to state-of-the-art approaches. The system achieves this by efficiently utilizing heterogeneous GPU resources, reducing memory waste, and improving model quality. It also supports phase-aware model partitioning, which is crucial for handling the two-phase inference process of LLMs (prefill and decode). The system is implemented with a distributed runtime that handles preprocessing, postprocessing, and micro-batch scheduling for different generation phases.
LLM-PQ outperforms existing solutions in heterogeneous clusters by better utilizing memory and conducting phase-aware and precision-aware model partitions. It also performs well in homogeneous clusters, achieving throughput gains despite the smaller scale of the clusters. The system is designed to handle both online and offline serving tasks, with a focus on the offline task where prompt lengths and token generation numbers are known in advance. LLM-PQ's adaptive quantization and phase-aware partitioning strategies make it a promising solution for efficient LLM serving on heterogeneous GPU clusters.