26 Jun 2024 | Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan
MemServe is a system that integrates inter-request and intra-request optimizations for large language model (LLM) serving. It introduces MemPool, an elastic memory pool that manages distributed memory and KV caches across serving instances. MemPool provides APIs for memory allocation, index management, and distributed transfer, enabling context caching and disaggregated inference. MemServe uses a global scheduler with a locality-aware policy based on global prompt trees to maximize KV cache reuse. The system improves job completion time (JCT) and time-to-first-token (TTFT) by combining context caching with disaggregated inference. MemPool supports both inter-request and intra-request optimizations, including sequence parallelism and disaggregated inference. The system is implemented with MemPool and a global scheduler, and it is evaluated on four settings: PD-colocated, PD-colocated with caching, PD-disaggregated, and PD-disaggregated with caching. Results show that MemServe significantly improves JCT and TTFT compared to vanilla vLLM. The system also addresses challenges in managing KV cache across distributed instances, including memory layout and network transfer optimizations. MemPool's APIs enable efficient data transfer and indexing, supporting a wide range of LLM serving optimizations. The system's design allows for a unified platform that supports both inter-request and intra-request optimizations, improving the efficiency of LLM serving.MemServe is a system that integrates inter-request and intra-request optimizations for large language model (LLM) serving. It introduces MemPool, an elastic memory pool that manages distributed memory and KV caches across serving instances. MemPool provides APIs for memory allocation, index management, and distributed transfer, enabling context caching and disaggregated inference. MemServe uses a global scheduler with a locality-aware policy based on global prompt trees to maximize KV cache reuse. The system improves job completion time (JCT) and time-to-first-token (TTFT) by combining context caching with disaggregated inference. MemPool supports both inter-request and intra-request optimizations, including sequence parallelism and disaggregated inference. The system is implemented with MemPool and a global scheduler, and it is evaluated on four settings: PD-colocated, PD-colocated with caching, PD-disaggregated, and PD-disaggregated with caching. Results show that MemServe significantly improves JCT and TTFT compared to vanilla vLLM. The system also addresses challenges in managing KV cache across distributed instances, including memory layout and network transfer optimizations. MemPool's APIs enable efficient data transfer and indexing, supporting a wide range of LLM serving optimizations. The system's design allows for a unified platform that supports both inter-request and intra-request optimizations, improving the efficiency of LLM serving.