MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

26 Jun 2024 | Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan
MemServe is a unified system designed to enhance the efficiency of large language model (LLM) serving by integrating inter-request and intra-request optimizations. The core of MemServe is an elastic memory pool, MemPool, which manages distributed memory and KV caches across serving instances. MemPool offers a rich set of APIs for memory allocation, indexing, and distributed data transfer, enabling context caching and disaggregated inference. MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-token, demonstrating its effectiveness in improving LLM serving performance.MemServe is a unified system designed to enhance the efficiency of large language model (LLM) serving by integrating inter-request and intra-request optimizations. The core of MemServe is an elastic memory pool, MemPool, which manages distributed memory and KV caches across serving instances. MemPool offers a rich set of APIs for memory allocation, indexing, and distributed data transfer, enabling context caching and disaggregated inference. MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-token, demonstrating its effectiveness in improving LLM serving performance.
Reach us at info@study.space