4 Jul 2024 | Bin Lin*, Chen Zhang*, Tao Peng*, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, Wei Lin
Infinite-LLM is a novel LLM serving system designed to efficiently handle dynamic context lengths. It addresses the challenges posed by the autoregressive nature of LLMs, which results in highly dynamic behavior of attention layers with significant differences in computational characteristics and memory requirements compared to non-attention layers. Traditional static model parallelism and resource allocation strategies fall short in managing these dynamic demands. Infinite-LLM introduces DistAttention, a distributed attention mechanism that allows for flexible and independent resource scheduling, optimizing computational performance and enhancing memory utilization. By leveraging pooled GPU memory across a cluster, Infinite-LLM significantly boosts system throughput and supports extensive context lengths. Evaluations on a dataset with context lengths ranging from a few to 2000K tokens across a cluster with 32 A100 GPUs show a throughput improvement of 1.35-3.4x compared to state-of-the-art methods, enabling efficient and elastic LLM deployment. The system's key innovations include DistAttention, a distributed attention mechanism, a greedy scheduling policy, and a centralized controller for dynamic inter-instance KVCache tracking and migration.Infinite-LLM is a novel LLM serving system designed to efficiently handle dynamic context lengths. It addresses the challenges posed by the autoregressive nature of LLMs, which results in highly dynamic behavior of attention layers with significant differences in computational characteristics and memory requirements compared to non-attention layers. Traditional static model parallelism and resource allocation strategies fall short in managing these dynamic demands. Infinite-LLM introduces DistAttention, a distributed attention mechanism that allows for flexible and independent resource scheduling, optimizing computational performance and enhancing memory utilization. By leveraging pooled GPU memory across a cluster, Infinite-LLM significantly boosts system throughput and supports extensive context lengths. Evaluations on a dataset with context lengths ranging from a few to 2000K tokens across a cluster with 32 A100 GPUs show a throughput improvement of 1.35-3.4x compared to state-of-the-art methods, enabling efficient and elastic LLM deployment. The system's key innovations include DistAttention, a distributed attention mechanism, a greedy scheduling policy, and a centralized controller for dynamic inter-instance KVCache tracking and migration.