The paper "LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization" addresses the challenge of efficiently serving large-scale language models (LLMs) on heterogeneous GPU clusters. LLMs, such as GPT3, LLaMA, OPT, and BLOOM, have demonstrated impressive performance but require significant computational resources, making their deployment costly. The authors propose LLM-PQ, a system that combines adaptive model quantization and phase-aware partitioning to optimize LLM serving efficiency on heterogeneous clusters.
Key contributions of LLM-PQ include:
1. **Cost Model**: A detailed model to predict memory usage and latency under mixed-precision quantization.
2. **Adaptive Mixed-Precision Quantization**: Chooses different quantization precisions for different layers based on the available memory and GPU capabilities.
3. **Phase-Aware Model Partitioning**: Accounts for the two phases of LLM inference (prefill and decode) to optimize resource utilization.
4. **Micro-Batch Sizing**: Efficiently schedules micro-batches to balance execution times across phases.
Experiments on 11 different clusters show that LLM-PQ achieves up to 2.88× (average of 2.26×) improvement in inference throughput compared to state-of-the-art methods, demonstrating its effectiveness in leveraging heterogeneous GPU clusters for efficient LLM serving. The source code is available at https://github.com/tonyzhao-jt/LLM-PQ.The paper "LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization" addresses the challenge of efficiently serving large-scale language models (LLMs) on heterogeneous GPU clusters. LLMs, such as GPT3, LLaMA, OPT, and BLOOM, have demonstrated impressive performance but require significant computational resources, making their deployment costly. The authors propose LLM-PQ, a system that combines adaptive model quantization and phase-aware partitioning to optimize LLM serving efficiency on heterogeneous clusters.
Key contributions of LLM-PQ include:
1. **Cost Model**: A detailed model to predict memory usage and latency under mixed-precision quantization.
2. **Adaptive Mixed-Precision Quantization**: Chooses different quantization precisions for different layers based on the available memory and GPU capabilities.
3. **Phase-Aware Model Partitioning**: Accounts for the two phases of LLM inference (prefill and decode) to optimize resource utilization.
4. **Micro-Batch Sizing**: Efficiently schedules micro-batches to balance execution times across phases.
Experiments on 11 different clusters show that LLM-PQ achieves up to 2.88× (average of 2.26×) improvement in inference throughput compared to state-of-the-art methods, demonstrating its effectiveness in leveraging heterogeneous GPU clusters for efficient LLM serving. The source code is available at https://github.com/tonyzhao-jt/LLM-PQ.