Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

4 Jul 2024 | Bin Lin*, Chen Zhang†, Tao Peng*, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, Wei Lin
Infinite-LLM is a novel LLM serving system designed to efficiently handle dynamic context lengths. It disaggregates attention layers from the rest of the LLM model, enabling flexible and independent resource scheduling that optimizes computational performance and enhances memory utilization. By leveraging a pooled GPU memory strategy across a cluster, Infinite-LLM significantly boosts system throughput and supports extensive context lengths. Evaluated on a dataset with context lengths ranging from a few to 2000K tokens across a cluster with 32 A100 GPUs, Infinite-LLM demonstrates throughput improvements of 1.35-3.4x compared to state-of-the-art methods, enabling efficient and elastic LLM deployment. The key challenges in LLM serving stem from the significant differences between attention and non-attention layers. Attention layers exhibit dynamic behavior and are not affected by batch size, while non-attention layers are sensitive to batch size and have static behavior. To address these challenges, Infinite-LLM introduces DistAttention, a novel attention mechanism that allows for flexible disaggregation of attention computation and KVCache in a distributed way. DistAttention is mathematically equivalent to the original attention and avoids the need to perform max and summation operations across all sequences, allowing each instance to execute these operations locally on partial KVCache data. Infinite-LLM also introduces a new centralized controller, gManager, to host the scheduling policy and coordinate the dynamic inter-instance KVCache tracking and migration. This system achieves significant improvements in resource efficiency and system throughput by optimizing resource allocation across the cluster. The system is evaluated against state-of-the-art methods, demonstrating superior performance in terms of throughput and resource utilization. The results show that Infinite-LLM can serve 2,000K tokens with 32 GPUs, achieving 1.35-3.4× improvement in end-to-end performance compared to existing methods.Infinite-LLM is a novel LLM serving system designed to efficiently handle dynamic context lengths. It disaggregates attention layers from the rest of the LLM model, enabling flexible and independent resource scheduling that optimizes computational performance and enhances memory utilization. By leveraging a pooled GPU memory strategy across a cluster, Infinite-LLM significantly boosts system throughput and supports extensive context lengths. Evaluated on a dataset with context lengths ranging from a few to 2000K tokens across a cluster with 32 A100 GPUs, Infinite-LLM demonstrates throughput improvements of 1.35-3.4x compared to state-of-the-art methods, enabling efficient and elastic LLM deployment. The key challenges in LLM serving stem from the significant differences between attention and non-attention layers. Attention layers exhibit dynamic behavior and are not affected by batch size, while non-attention layers are sensitive to batch size and have static behavior. To address these challenges, Infinite-LLM introduces DistAttention, a novel attention mechanism that allows for flexible disaggregation of attention computation and KVCache in a distributed way. DistAttention is mathematically equivalent to the original attention and avoids the need to perform max and summation operations across all sequences, allowing each instance to execute these operations locally on partial KVCache data. Infinite-LLM also introduces a new centralized controller, gManager, to host the scheduling policy and coordinate the dynamic inter-instance KVCache tracking and migration. This system achieves significant improvements in resource efficiency and system throughput by optimizing resource allocation across the cluster. The system is evaluated against state-of-the-art methods, demonstrating superior performance in terms of throughput and resource utilization. The results show that Infinite-LLM can serve 2,000K tokens with 32 GPUs, achieving 1.35-3.4× improvement in end-to-end performance compared to existing methods.
Reach us at info@study.space
[slides and audio] Infinite-LLM%3A Efficient LLM Service for Long Context with DistAttention and Distributed KVCache