[slides and audio] LoongServe%3A Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism

The paper "LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism" addresses the challenges of serving long-context large language models (LLMs) by proposing a new parallelism paradigm called Elastic Sequence Parallelism (ESP). ESP dynamically adjusts the degree of parallelism (DoP) in real-time to adapt to the varying resource demands of different requests and phases. The authors design and build LoongServe, an LLM serving system, which improves computation efficiency by dynamically adjusting DoP, communication efficiency by reducing key-value cache migration overhead, and GPU memory efficiency by reducing cache fragmentation. Evaluations on diverse real-world datasets show that LoongServe significantly enhances maximum throughput compared to existing methods. The contributions of the paper include identifying the limitations of existing solutions, proposing ESP, and evaluating LoongServe's effectiveness.The paper "LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism" addresses the challenges of serving long-context large language models (LLMs) by proposing a new parallelism paradigm called Elastic Sequence Parallelism (ESP). ESP dynamically adjusts the degree of parallelism (DoP) in real-time to adapt to the varying resource demands of different requests and phases. The authors design and build LoongServe, an LLM serving system, which improves computation efficiency by dynamically adjusting DoP, communication efficiency by reducing key-value cache migration overhead, and GPU memory efficiency by reducing cache fragmentation. Evaluations on diverse real-world datasets show that LoongServe significantly enhances maximum throughput compared to existing methods. The contributions of the paper include identifying the limitations of existing solutions, proposing ESP, and evaluating LoongServe's effectiveness.

LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism

15 Apr 2024 | Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, Xin Jin