9 Jan 2024 | Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, Yuxiong He
DeepSpeed-FastGen is a system designed to enhance the deployment and scaling of large language models (LLMs) by improving throughput and reducing latency. The system introduces Dynamic SplitFuse, a novel prompt and generation composition strategy, which achieves up to 2.3x higher effective throughput, 2x lower average latency, and up to 3.7x lower tail latency compared to state-of-the-art systems like vLLM. DeepSpeed-FastGen leverages the combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and user-friendly serving system for LLMs. It supports a range of models and offers both non-persistent and persistent deployment options. The paper presents a detailed benchmarking methodology, analyzes performance through latency-throughput curves, and investigates scalability via load balancing. Evaluations demonstrate significant improvements in throughput and latency across various models and hardware configurations. The authors also discuss future enhancements, including broader model support and new hardware backends. The DeepSpeed-FastGen code is available for community engagement and contribution.DeepSpeed-FastGen is a system designed to enhance the deployment and scaling of large language models (LLMs) by improving throughput and reducing latency. The system introduces Dynamic SplitFuse, a novel prompt and generation composition strategy, which achieves up to 2.3x higher effective throughput, 2x lower average latency, and up to 3.7x lower tail latency compared to state-of-the-art systems like vLLM. DeepSpeed-FastGen leverages the combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and user-friendly serving system for LLMs. It supports a range of models and offers both non-persistent and persistent deployment options. The paper presents a detailed benchmarking methodology, analyzes performance through latency-throughput curves, and investigates scalability via load balancing. Evaluations demonstrate significant improvements in throughput and latency across various models and hardware configurations. The authors also discuss future enhancements, including broader model support and new hardware backends. The DeepSpeed-FastGen code is available for community engagement and contribution.