DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

9 Jan 2024 | Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, Yuxiong He
DeepSpeed-FastGen is a high-throughput text generation system for large language models (LLMs) that leverages Dynamic SplitFuse, a novel prompt and generation composition strategy. It combines DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user scenarios. It achieves up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency compared to state-of-the-art systems like vLLM. The system is evaluated on various models and hardware configurations, demonstrating significant improvements in throughput and latency. It also provides a detailed benchmarking methodology, analyzes performance through latency-throughput curves, and investigates scalability via load balancing. DeepSpeed-FastGen is available for community engagement and contribution. The system is designed to handle long prompts efficiently by decomposing them into smaller chunks and scheduling them across multiple forward passes, ensuring consistent performance and lower latency. It also supports a range of hardware platforms, including A100, H100, and A6000. The system is implemented as a synergistic composition of DeepSpeed-MII and DeepSpeed-Inference, providing various components of the system including frontend APIs, host and device infrastructure, optimized kernel implementations, and tools to construct new model implementations. The system is available for use through an alpha release, with support for models like LLaMA, Mistral, and Facebook OPT. It offers deployment options for interactive non-persistent pipelines and persistent serving deployments. The system is part of the larger DeepSpeed ecosystem and encourages community contributions, collaborations, and feedback. The roadmap includes performance improvements, broader model support, and new hardware backends through collaboration with partners.DeepSpeed-FastGen is a high-throughput text generation system for large language models (LLMs) that leverages Dynamic SplitFuse, a novel prompt and generation composition strategy. It combines DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user scenarios. It achieves up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency compared to state-of-the-art systems like vLLM. The system is evaluated on various models and hardware configurations, demonstrating significant improvements in throughput and latency. It also provides a detailed benchmarking methodology, analyzes performance through latency-throughput curves, and investigates scalability via load balancing. DeepSpeed-FastGen is available for community engagement and contribution. The system is designed to handle long prompts efficiently by decomposing them into smaller chunks and scheduling them across multiple forward passes, ensuring consistent performance and lower latency. It also supports a range of hardware platforms, including A100, H100, and A6000. The system is implemented as a synergistic composition of DeepSpeed-MII and DeepSpeed-Inference, providing various components of the system including frontend APIs, host and device infrastructure, optimized kernel implementations, and tools to construct new model implementations. The system is available for use through an alpha release, with support for models like LLaMA, Mistral, and Facebook OPT. It offers deployment options for interactive non-persistent pipelines and persistent serving deployments. The system is part of the larger DeepSpeed ecosystem and encourages community contributions, collaborations, and feedback. The roadmap includes performance improvements, broader model support, and new hardware backends through collaboration with partners.
Reach us at info@study.space